HiCExplorer

Set of programs to process, normalize, analyze and visualize Hi-C and cHi-C data

HiCExplorer addresses the common tasks of Hi-C data analysis from processing to visualization.

_images/hicex3.png

Availability

HiCExplorer is available as a command line suite of tools on this GitHub repository.

A Galaxy HiCExplorer version is directly available to users at http://hicexplorer.usegalaxy.eu. Training material is available at the Galaxy Training Network, while a Galaxy Tour is available here for users not familiar with this platform. Galaxy HiCExplorer is also available as a Docker image at the Docker Galaxy HiCExplorer GitHub repository. Finally, this Galaxy version is available on the Galaxy Tool Shed and on the corresponding GitHub repository.

The following is the list of tools available in HiCExplorer

tool

description

hicFindRestSites

Identifies the genomic locations of restriction sites

hicBuildMatrix

Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads

hicQuickQC

Estimates the quality of Hi-C dataset

hicQC

Plots QC measures from the output of hicBuildMatrix

hicCorrectMatrix

Uses iterative correction to remove biases from a Hi-C matrix

hicDetectLoops

Identifies enriched Hi-C contacts

hicCorrelate

Computes and visualizes the correlation of Hi-C matrices

hicFindTADs

Identifies Topologically Associating Domains (TADs)

hicMergeDomains

Merges TADs of different resolutions

hicDifferentialTAD

Computes differential TADs between two Hi-C samples

hicPCA

Computes for A / B compartments the eigenvectors

hicTransform

Computes a obs_exp matrix like Lieberman-Aiden (2009), a pearson correlation matrix and or a covariance matrix. These matrices can be used for plotting.

hicMergeMatrixBins

Merges consecutive bins on a Hi-C matrix to reduce resolution

hicMergeTADbins

Uses a BED file of domains or TAD boundaries to merge the bin counts of a Hi-C matrix.

hicPlotDistVsCounts

Plot the decay in interaction frequency with distance

hicPlotMatrix

Plots a Hi-C matrix as a heatmap

hicPlotTADs

Plots TADs as a track that can be combined with other tracks (genes, signal, interactions)

hicPlotViewpoint

A plot with the interactions around a reference point or region.

hicAggregateContacts

A tool that allows plotting of aggregated Hi-C sub-matrices of a specified list of positions.

hicSumMatrices

Adds Hi-C matrices of the same size

hicPlotDistVsCounts

Plots distance vs. Hi-C counts of corrected data

hicInfo

Shows information about a Hi-C matrix file (no. of bins, bin length, sum, max, min, etc)

hicCompareMatrices

Computes difference or ratio between two matrices

hicAverageRegions

Computes the average of multiple given regions, usually TAD regions

hicPlotAverageRegions

visualization of hicAverageRegions

hicNormalize

Normalizes the given matrices to 0-1 range or the smallest read coverage

hicConvertFormat

Converts between different Hi-C interaction matrices

hicAdjustMatrix

Keeps, removes or masks regions in a Hi-C matrix

hicValidateLocations

Compare the loops with known peak protein locations

hicMergeLoops

Merges loops of different resolutions

hicCompartmentalization

Compute the global compartmentalization signal

chicQualityControl

Quality control for cHi-C data

chicViewpointBackgroundModel

Background model computation for cHi-C analysis

chicViewpoint

Computation of all viewpoints based on background model for cHi-C analysis

chicSignificantInteractions

Detection of significant interactions per viewpoint based on background model

chicAggregateStatistic

Compiling of target regions for two samples as input for differential analysis

chicDifferentialTest

Differential analysis of interactions of two samples

chicPlotViewpoint

Plotting of viewpoint with background model and highlighting of significant and differential regions

hicPlotSVL

Computing short vs long range contacts and plotting the results

hicHyperoptDetectLoops

Search for optimal hicDectectLoops parameters

hicHyperoptDetectLoopsHiCCUPS

Search for optimal Juicer HiCCUPS parameters

Getting Help

  • For all kind of questions, suggesting changes/enhancements and to report bugs, please create an issue on our GitHub repository

  • In the past we offered to post on Biostars with Tag hicexplorer : Biostars or on the deepTools mailing list. We still check these resources from time to time but the preferred way to communicate are GitHub issues.

Contents:

Installation

Requirements

  • python >= 3.6

  • numpy >= 1.19.*

  • scipy >= 1.5.*

  • matplotlib-base >= 3.1.*

  • ipykernel >= 5.3.0

  • pysam >= 0.16.*

  • intervaltree >= 3.1.*

  • biopython < 1.77

  • pytables >= 3.6.*

  • pandas >= 1.1.*

  • pybigwig >= 0.3.*

  • jinja2 >= 2.11

  • unidecode >= 1.1.*

  • hicmatrix >= 15

  • hic2cool >= 0.8.3

  • psutil >= 5.7.*

  • pygenometracks >= 3.5

  • fit_nbinom >= 1.1

  • cooler >= 0.8.10

  • krbalancing >= 0.0.5 (Needs the library eigen; openmp is recommended for linux users. No openmp support on macOS.)

  • pybedtools >= 0.8.*

  • future >= 0.18

  • tqdm >= 4.50

  • hyperopt >= 0.2.4

  • python-graphviz >= 0.14

Warning: Python 2.7 support is discontinued. Moreover, the support for pip is discontinued too. Warning: We strongly recommend to use the conda package manager and will no longer give support on all issues raising with pip.

Command line installation using conda

The fastet way to obtain Python 3.6 or 3.7 together with numpy and scipy is via the Anaconda Scientific Python Distribution. Just download the version that’s suitable for your operating system and follow the directions for its installation. All of the requirements for HiCExplorer can be installed in Anaconda with:

$ conda install hicexplorer -c bioconda -c conda-forge

We strongly recommended to use conda to install HiCExplorer.

Command line installation using pip

The installation via pip is discontinued with version 3.0. The reason for this is that we want to provide a ‘one-click’ installation. However, with version 3.0 we added the C++ library eigen as dependency and pip does not support non-Python packages.

For older versions you can still use pip: Install HiCExplorer using the following command:

$ pip install hicexplorer

All python requirements should be automatically installed.

If you need to specify a specific path for the installation of the tools, make use of pip install’s numerous options:

$ pip install --install-option="--prefix=/MyPath/Tools/hicexplorer" git+https://github.com/deeptools/HiCExplorer.git

Warning: It can be that you have to install additional packages via your system package manager to successfully install HiCExplorer via pip. Warning: We strongly recommend to use the conda package manager and will no longer give support on all issues raising with pip.

Command line installation without pip

You are highly recommended to use pip rather than these more complicated steps.

  1. Install the requirements listed above in the “requirements” section. This is done automatically by pip.

2. Download source code

$ git clone https://github.com/deeptools/HiCExplorer.git

or if you want a particular release, choose one from https://github.com/deeptools/HiCExplorer/releases:

$ wget https://github.com/deeptools/HiCExplorer/archive/1.5.12.tar.gz
$ tar -xzvf

3. To install the source code (if you don’t have root permission, you can set a specific folder using the --prefix option)

$ python setup.py install --prefix /User/Tools/hicexplorer

Galaxy installation

HiCExplorer can be easily integrated into a local Galaxy, the wrappers are provided at the Galaxy tool shed.

Installation with Docker

The HiCExplorer Galaxy instance is also available as a docker container, for those wishing to use the Galaxy framework but who also prefer a virtualized solution. This container is quite simple to install:

$ sudo docker pull quay.io/bgruening/galaxy-hicexplorer

To start and otherwise modify this container, please see the instructions on the docker-galaxy-stable github repository. Note that you must use bgruening/galaxy-hicexplorer in place of bgruening/galaxy-stable in the examples, as the HiCExplorer Galaxy container is built on top of the galaxy-stable container.

Tip

For support, or feature requests contact: deeptools@googlegroups.com

HiCExplorer tools

Tools for Hi-C data pre-processing

hicFindRestSites

Identifies the genomic locations of restriction sites.

usage: hicFindRestSite --fasta mm10.fa --searchPattern AAGCTT -o rest_site_positions.bed
Required arguments
--fasta, -f

Path to fasta file for the organism genome.

--searchPattern, -p

Search pattern. For example, for HindIII this pattern is “AAGCTT”. Both, forward and reverse strand are searched for a match. The pattern is a regexp and can contain regexp specif syntax (see https://docs.python.org/2/library/re.html). For example the patternCG..GC will find all occurrence of CG followed by any two bases and then GC.

--outFile, -o

Name for the resulting bed file.

Optional arguments
--version

show program’s version number and exit

Further usage

In case multiple restriction enzymes are used in one experiment, hicFindRestSite can be used to find restriction sites individually per enzyme. Afterwards, all output bed files should be combined. However, it should be noted that the QC report will not be correct for this specific usage.

hicBuildMatrix

Using an alignment from a program that supports local alignment (eg. Bowtie2) where both PE reads are mapped using the –local option, this program reads such file and creates a matrix of interactions.

usage: hicBuildMatrix --samFiles two sam files two sam files --outFileName
                      FILENAME --QCfolder FOLDER --restrictionCutFile BED file
                      [BED file ...] --restrictionSequence RESTRICTIONSEQUENCE
                      [RESTRICTIONSEQUENCE ...] --danglingSequence
                      DANGLINGSEQUENCE [DANGLINGSEQUENCE ...]
                      [--outBam bam file] [--binSize BINSIZE [BINSIZE ...]]
                      [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE]
                      [--maxLibraryInsertSize MAXLIBRARYINSERTSIZE]
                      [--genomeAssembly GENOMEASSEMBLY]
                      [--region CHR:START-END]
                      [--removeSelfLigation REMOVESELFLIGATION]
                      [--keepSelfCircles]
                      [--minMappingQuality MINMAPPINGQUALITY]
                      [--threads THREADS] [--inputBufferSize INPUTBUFFERSIZE]
                      [--doTestRun] [--doTestRunLines DOTESTRUNLINES]
                      [--skipDuplicationCheck] [--chromosomeSizes txt file]
                      [--help] [--version]
Required arguments
--samFiles, -s

The two PE alignment sam files to process

--outFileName, -o

Output file name for the Hi-C matrix.

--QCfolder

Path of folder to save the quality control data for the matrix. The log files produced this way can be loaded into hicQC in order to compare the quality of multiple Hi-C libraries.

--restrictionCutFile, -rs

BED file(s) with all restriction cut sites (output of “hicFindRestSite” command). Should only contain the restriction sites of the same genome which has been used to generate the input sam files. Using regions of a different genome version can generate false results! To use more than one restriction enzyme, generate a restrictionCutFile for each enzyne and list them space seperated.

--restrictionSequence, -seq

Sequence of the restriction site, if multiple are used, please list them space seperated. If a dangling sequence is listed at the same time, please preserve the same order.

--danglingSequence

Sequence left by the restriction enzyme after cutting, if multiple are used, please list them space seperated and preserve the order. Each restriction enzyme recognizes a different DNA sequence and, after cutting, they leave behind a specific “sticky” end or dangling end sequence. For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT. For DpnII, the restriction site and dangling end sequence are the same: GATC. This information is easily found on the description of the restriction enzyme. The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads. A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol.

Optional arguments
--outBam, -b

Output bam file to process. Optional parameter. A bam file containing all valid Hi-C reads can be created using this option. This bam file could be useful to inspect the distribution of valid Hi-C reads pairs or for other downstream analyses, but is not used by any HiCExplorer tool. Computation will be significantly longer if this option is set.

--binSize, -bs

Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. If not given, matrices of restriction site resolution will be built. Optionally for mcool file format: Define multiple resolutions which are all a multiple of the first value. Example: –binSize 10000 20000 50000 will create a mcool file formate containing the three defined resolutions.

--minDistance

Minimum distance between restriction sites. Restriction sites that are closer than this distance are merged into one. This option only applies if –restrictionCutFile is given.

Default: 300

--maxDistance

This parameter is now obsolete. Use –maxLibraryInsertSize instead

--maxLibraryInsertSize

The maximum library insert size defines different cut offs based on the maximum expected library size. This is not the average fragment size but the higher end of the the fragment size distribution (obtained using for example a Fragment Analyzer or a Bioanalyzer) which usually is between 800 to 1500 bp. If this value is not known use the default of 1000. The insert value is used to decide if two mates belong to the same fragment (by checking if they are within this max insert size) and to decide if a mate is too far away from the nearest restriction site.

Default: 1000

--genomeAssembly, -ga

The genome the reads were mapped to. Used for metadata of cool file.

--region, -r

Region of the genome to limit the operation to. The format is chr:start-end. It is also possible to just specify a chromosome, for example –region chr10

--removeSelfLigation

If set, inward facing reads less than 1000 bp apart and having a restrictionsite in between are removed. Although this reads do not contribute to any distant contact, they are useful to account for bias in the data.

Default: True

--keepSelfCircles

If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots.

Default: False

--minMappingQuality

minimum mapping quality for reads to be accepted. Because the restriction enzyme site could be located on top of the read, this may reduce the reported quality of the read. Thus, this parameter may be adjusted if too many low quality (but otherwise perfectly valid Hi-C reads) are found. A good strategy is to make a test run (using the –doTestRun), then checking the results to see if too many low quality reads are present and then using the bam file generated to check if those low quality reads are caused by the read not being mapped entirely.

Default: 15

--threads

Number of threads. Using the python multiprocessing module. One master process which is used to read the input file into the buffer and one process which is merging the output bam files of the processes into one output bam file. All other threads do the actual computation. Minimum value for the ‘–thread’ parameter is 2. The usage of 8 threads is optimal if you have an HDD. A higher number of threads is only useful if you have a fast SSD. Have in mind that the performance of hicBuildMatrix is influenced by the number of threads, the speed of your hard drive and the inputBufferSize. To clarify: the performance with a higher thread number is not negative influenced but not positive too. With a slow HDD and a high number of threads many threads will do nothing most of the time.

Default: 4

--inputBufferSize

Size of the input buffer of each thread. 400,000 read pairs per input file per thread is the default value. Reduce this value to decrease memory usage.

Default: 400000

--doTestRun

A test run is useful to test the quality of a Hi-C experiment quickly. It works by testing only 1,000,000 reads. This option is useful to get an idea of quality control values like inter-chromosomal interactions, duplication rates etc.

Default: False

--doTestRunLines

Number of lines to consider for the qc test run.

Default: 1000000

--skipDuplicationCheck

Identification of duplicated read pairs is memory consuming. Thus, in case of memory errors this check can be skipped. However, consider running a –doTestRun first to get an estimation of the duplicated reads.

Default: False

--chromosomeSizes, -cs

File with the chromosome sizes for your genome. A tab-delimited two column layout “chr_name size” is expectedUsually the sizes can be determined from the SAM/BAM input files, however, for cHi-C or scHi-C it can be that at the start or end no data is present. Please consider that this option causes that only reads are considered which are on the listed chromosomes.Use this option to guarantee fixed sizes. An example file is available via UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/dm3.chrom.sizes

--version

show program’s version number and exit

Please note that the file type extension for the output matrix (--outFileName) must be given! This can be .h5, .cool or the specializations of cool .mcool, if the path is given! For .scool files please create one .cool file per cell and merge it together with scHiCExplorer’s scHicMergeToSCool.

Building multicooler matrices

hicBuildMatrix supports building multicooler matrices which are for example needed for visualization with HiGlass. To do so, use as out file format either .cool or .mcool and define the desired resolutions as –binSize. hicBuildMatrix builds the interaction matrix for the highest resolution and merges the bins for the lower resolutions. The lower resolutions need to be an integer multiplicative of the highest resolution.

$ hicBuildMatrix -s forward.bam reverse.bam -o multi_resolution.cool
  --binSize 10000 20000 50000 100000 --QCfolder QC

Introducing with version 3.5 we support multiple restriction and dangling end sequences, and multiple restriction cut site files. Hi-C protocols that use multiple restriction cut enzymes benefit from this and get now an improved QC report. Version 3.5 adds also the support for a chromosome size file which can help to get interaction matrices with a predefined size. Capture Hi-C or single-cell Hi-C data, where it is not guaranteed that reads from all areas of the chromosome are present benefit from this latest improvement.

Missing scaffolds or contigs in the Hi-C matrix

Restriction enzymes cut the DNA at their specific restriction cut site sequences. It can occur for scaffolds or contigs (less likely it happens for chromosomes) that it does not contain any of these cut sites. In this case, the reads from this scaffold or contig are considered as invalid and are part of the ‘same restriction fragment’-statistics in the QC report.

hicSumMatrices

Adds Hi-C matrices of the same size. Format has to be hdf5 (.h5) or npz. In order to minimize the the loss of information, it is recommended to to sum uncorrected matrices (before hicCorrectMatrix).

usage: hicSumMatrices --matrices .h5 or cooler file format
                      [.h5 or cooler file format ...] --outFileName
                      OUTFILENAME [-h] [--version]
Required arguments
--matrices, -m

Space-delimited names of the matrices to add. The matrices must have the same shape/size. You can verify their size by using hicInfo.

--outFileName, -o

File name to save the resulting matrix. The output is from the same file type as the input. Please add the file ending suffix (either .h5 or .cool), if it is not given, there will be no output.

Optional arguments
--version

show program’s version number and exit

hicMergeMatrixBins
Background

Depending on the downstream analyses to perform on a Hi-C matrix generated with HiCExplorer, one might need different bin resolutions. For example using hicPlotMatrix to display chromatin interactions of a whole chromosome will not produce any meaningful visualization if it is performed on a matrix at restriction sites resolution. hicMergeMatrixBins address this issue by merging a given number of adjacent bins (determined by --numBins). To minimize the loss of information (matrix bins with few counts), it is recommended to perform hicMergeMatrixBins on uncorrected matrices (before matrix correction). The matrix correction can be performed after bin merging for downstream analyses using hicCorrectMatrix.

Description

Merges bins from a Hi-C matrix. For example, using a matrix containing 5kb bins, a matrix of 50kb bins can be derived using –numBins 10. From one type of downstream analysis to another, different bin sizes are used. For example to call TADs, unmerged matrices are recommended while to display Hi-C matrices, bins of approximately 2000bp usually yield the best representations with hicPlotMatrix for small regions, and even larger bins (50kb) are recommended for whole chromosome representations or for hicPlotDistVsCounts.

usage: hicMergeMatrixBins --matrix matrix.h5 --outFileName OUTFILENAME
                          --numBins int [--runningWindow] [--help] [--version]
Required arguments
--matrix, -m

Matrix to reduce in h5 format.

--outFileName, -o

File name to save the resulting matrix. The output is also a .h5 file. But don’t add the suffix.

--numBins, -nb

Number of bins to merge.

Optional arguments
--runningWindow

Set to merge for using a running window of length –numBins.

Default: False

--version

show program’s version number and exit

Usage example
Running hicMergeMatrixBins

Below, we display the example of plotting a Hi-C matrix at the scale of the whole X-chromosome and at the scale of a 1Mb region of the X chromosome. To do this, we will perform two different bin merging using hicMergeMatrixBins on an uncorrected matrix built at restriction site resolution using hicBuildMatrix. To do this, we run the two following commands:

$ hicMergeMatrixBins -m myMatrix.h5 -o myMatrix_merged_nb3.h5 -nb 3

$ hicMergeMatrixBins -m myMatrix.h5 -o myMatrix_merged_nb50.h5 -nb 50

Starting from a matrix myMatrix.h5 with bins of a median length of 529bp, the first command produces a matrix myMatrix_merged_nb3.h5 with bins of a median length of 1661bp while the second command produces a matrix myMatrix_merged_nb50.h5 with bins of a median length of 29798bp. This can be checked with hicInfo.

After correction of these three matrices using hicCorrectMatrix, we can now plot the corrected matrices myMatrix_corrected.h5, myMatrix_merged_nb3_corrected.h5 and myMatrix_merged_nb50_corrected.h5 at the scale of the whole X-chromosome and at the X:2000000-3000000 region to see the effect of bin merging on the interactions visualization.

Effect of bin merging at the scale of a chromosome
$ hicPlotMatrix -m myMatrix_corrected.h5 \
-o myMatrix_corrected_Xchr.png \
--chromosomeOrder X \
-t Restriction_sites_resolution --log1p \
--clearMaskedBins

$ hicPlotMatrix -m myMatrix_merged_nb3_corrected.h5 \
-o myMatrix_merged_nb3_corrected_Xchr.png \
--chromosomeOrder X \
-t Bins_merged_by_3 --log1p \
--clearMaskedBins

 $ hicPlotMatrix -m myMatrix_merged_nb50_corrected.h5 \
-o myMatrix_merged_nb50_corrected_Xchr.png \
--chromosomeOrder X \
-t Bins_merged_by_50 --log1p \
--clearMaskedBins

When observed altogether, the plots produced by these three commands show that merging of bins by 50 is the most adequate way to plot interactions for a whole chromosome in Drosophila melanogaster when starting from a matrix with bins of a median length of 529bp.

_images/hicMergeMatrixBins_Xchr.png
Effect of bin merging at the scale of a specific region
 $ hicPlotMatrix -m myMatrix_corrected.h5 \
-o myMatrix_corrected_Xregion.png \
--region X:2000000-3000000 \
-t Restriction_sites_resolution --log1p \
--clearMaskedBins

$ hicPlotMatrix -m myMatrix_merged_nb3_corrected.h5 \
-o myMatrix_merged_nb3_corrected_Xregion.png \
--region X:2000000-3000000 \
-t Bins_merged_by_3 --log1p \
--clearMaskedBins

 $ hicPlotMatrix -m myMatrix_merged_nb50_corrected.h5 \
-o myMatrix_merged_nb50_corrected_Xregion.png \
--region X:2000000-3000000 \
-t Bins_merged_by_50 --log1p \
--clearMaskedBins

When observed altogether, the plots produced by these three commands show that merging of bins by 3 is the most adequate way to plot interactions for a region of 1Mb in Drosophila melanogaster when starting from a matrix with bins of a median length of 529bp.

_images/hicMergeMatrixBins_Xregion.png
hicCorrectMatrix

This function provides 2 balancing methods which can be applied on a raw matrix.

I. KR: It balances a matrix using a fast balancing algorithm introduced by Knight and Ruiz (2012).

II. ICE: Iterative correction of a Hi-C matrix (see Imakaev et al. 2012 Nature Methods for details). For this method to work correctly, bins with zero reads assigned to them should be removed as they cannot be corrected. Also, bins with low number of reads should be removed, otherwise, during the correction step, the counts associated with those bins will be amplified (usually, zero and low coverage bins tend to contain repetitive regions). Bins with extremely high number of reads can also be removed from the correction as they may represent copy number variations.

To aid in the identification of bins with low and high read coverage, the histogram of the number of reads can be plotted together with the Median Absolute Deviation (MAD).

It is recommended to run hicCorrectMatrix as follows:

$ hicCorrectMatrix diagnostic_plot –matrix hic_matrix.h5 -o plot_file.png

Then, after revising the plot and deciding on the threshold values:

$ hicCorrectMatrix correct –correctionMethod ICE –matrix hic_matrix.h5 –filterThreshold <lower threshold> <upper threshold> -o corrected_matrix

For a more in-depth review of how to determine the threshold values, please visit: http://hicexplorer.readthedocs.io/en/latest/content/example_usage.html#correction-of-hi-c-matrix

We recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

usage: hicCorrectMatrix [-h] [--version]  ...
Named Arguments
--version

show program’s version number and exit

Options

Possible choices: diagnostic_plot, correct

To get detailed help on each of the options:

$ hicCorrectMatrix diagnostic_plot -h

$ hicCorrectMatrix correct -h

Sub-commands:
diagnostic_plot
Plots a histogram of the coverage per bin together with the

modified z-score based on the median absolute deviation method

(see Boris Iglewicz and David Hoaglin 1993, Volume 16: How to Detect and Handle Outliers The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor).

hicCorrectMatrix diagnostic_plot --matrix hic_matrix.h5 -o file.png
Required arguments
--matrix, -m

Name of the Hi-C matrix to correct in .h5 format.

--plotName, -o

File name to save the diagnostic plot.

Optional arguments
--chromosomes

List of chromosomes to be included in the iterative correction. The order of the given chromosomes will be then kept for the resulting corrected matrix.

--xMax

Max value for the x-axis in counts per bin.

--perchr

Compute histogram per chromosome. For samples from cells with uneven number of chromosomes and/or translocations it is advisable to check the histograms per chromosome to find the most conservative filterThreshold.

Default: False

--verbose

Print processing status.

Default: False

correct

Run Knight-Ruiz matrix balancing algorithm (KR) or the iterative matrix correction (ICE) .

hicCorrectMatrix correct --matrix hic_matrix.h5 --filterThreshold -1.2 5 (Only if ICE)-out corrected_matrix.h5
Required arguments
--matrix, -m

Name of the Hi-C matrix to correct in .h5 format.

--outFileName, -o

File name to save the resulting matrix. The output is a .h5 file.

Optional arguments
--correctionMethod

Method to be used for matrix correction. It can be set to KR or ICE.

Default: “KR”

--filterThreshold, -t

Removes bins of low or large coverage. Usually these bins do not contain valid Hi-C data or represent regions that accumulate reads and thus must be discarded. Use hicCorrectMatrix diagnostic_plot to identify the modified z-value thresholds. A lower and upper threshold are required separated by space, e.g. –filterThreshold -1.5 5. Applied only for ICE!

--iterNum, -n

Number of iterations to compute.only for ICE!

Default: 500

--inflationCutoff

Value corresponding to the maximum number of times a bin can be scaled up during the iterative correction. For example, an inflation cutoff of 3 will filter out all bins that were expanded 3 times or more during the iterative correctionself.Only for ICE!

--transCutoff, -transcut

Clip high counts in the top -transcut trans regions (i.e. between chromosomes). A usual value is 0.05. Only for ICE!

--sequencedCountCutoff

Each bin receives a value indicating the fraction that is covered by reads. A cutoff of 0.5 will discard all those bins that have less than half of the bin covered. Only for ICE!

--chromosomes

List of chromosomes to be included in the iterative correction. The order of the given chromosomes will be then kept for the resulting corrected matrix

--skipDiagonal, -s

If set, diagonal counts are not included. Only for ICE!

Default: False

--perchr

Normalize each chromosome separately. This is useful for samples from cells with uneven number of chromosomes and/or translocations.

Default: False

--verbose

Print processing status.

Default: False

--version

show program’s version number and exit

With HiCExplorer version 3.0 we offer an additional Hi-C interaction matrix correction algorithm: Knight-Ruiz.

$ hicCorrectMatrix correct --matrix matrix.cool --correctionMethod KR --chromosomes chrUextra chr3LHet --outFileName corrected_KR.cool

The iterative correction can be used via:

$ hicCorrectMatrix correct --matrix matrix.cool --correctionMethod ICE --chromosomes chrUextra chr3LHet --iterNum 500  --outFileName corrected_ICE.cool --filterThreshold -1.5 5.0

HiCExplorer version 3.1 changes the way data is transferred from Python to C++ for the KR correction algorithm. With these changes the following runtime and peak memory usage on Rao 2014 GM12878 primary + replicate data is possible:

  • KR on 25kb: 165 GB, 1:08 h

  • ICE on 25kb: 224 GB, 3:10 h

  • KR on 10kb: 228 GB, 1:42 h

  • ICE on 10kb: 323 GB, 4:51 h

  • KR on 1kb: 454 GB, 16:50 h

  • ICE on 1kb: >600 GB, > 2.5 d (we interrupted the computation and strongly recommend to use KR on this resolution)

For HiCExplorer versions <= 3.0 KR performs as follows:

  • KR on 25kb: 159 GB, 57:11 min

  • KR on 10kb: >980 GB, – (out of memory on 1TB node, we do not have access to a node with more memory on our cluster)

hicNormalize

Normalizes given matrices either to the smallest given read number of all matrices or to 0 - 1 range. However, it does NOT compute the contact probability.

We recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

usage: hicNormalize --matrices MATRICES [MATRICES ...] --normalize
                    {norm_range,smallest,multiplicative} --outFileName
                    FILENAME [FILENAME ...]
                    [--multiplicativeValue MULTIPLICATIVEVALUE]
                    [--setToZeroThreshold SETTOZEROTHRESHOLD] [--help]
                    [--version]
Required arguments
--matrices, -m

The matrix (or multiple matrices) to get information about. HiCExplorer supports the following file formats: h5 (native HiCExplorer format) and cool.

--normalize, -n

Possible choices: norm_range, smallest, multiplicative

Normalize to a) 0 to 1 range, b) all matrices to the lowest read count of the given matrices.

Default: “smallest”

--outFileName, -o

Output file name for the Hi-C matrix.

Optional arguments
--multiplicativeValue, -mv

show this help message and exit

Default: 1

--setToZeroThreshold, -sz

A threshold to set all values after normalization to 0 if smaller this threshold. Default value is 0 i.e. there is no effect.It is recommended to set it for the normalize mode “smallest” to 1.0. This parameter will influence the sparsity of the matrix by removing many values close to 0 in smallest normalization mode.

Default: 0.0

--version

show program’s version number and exit

Background

To be able to compare different Hi-C interaction matrices the matrices need to be normalized to a equal level of read coverage or value ranges. hicNormalize accomplish this by offering two modes: 0-1 range normalization or a read count normalization.

Usage example
Normalize to 0-1 range

norm_range mode of hicNormalize normalizes all reads to the 0 to 1 range i.e. the maximum value of the interaction matrix becomes 1 and the minimum value 0.

$ hicNormalize -m matrix.cool --normalize norm_range -o matrix_0_1_range.cool
Normalize to smallest read count

All matrices are normalized in the way the total read count of each matrix is equal to the read count of the matrix with the smallest read count of all input matrices.

Example
  • matrix.cool with a read count of 10000

  • matrix2.cool with a read count of 12010

  • matrix3.cool with a read count of 11000

In this example each entry in matrix2.cool and matrix3.cool are normalized with a factor of 12010 / 10000 respective with 11000 / 10000.

$ hicNormalize -m matrix.cool matrix2.cool matrix3.cool --normalize smallest
  -o matrix_normalized.cool matrix2_normalized.cool matrix3_normalized.cool

Tools for Hi-C QC

hicQuickQC
Background

This tool considers the first 1,000,000 reads (or user defined number) of the mapped bam files to get a quality estimate of the Hi-C data.

Description

The tool hicQuickQC considers the first n lines of two bam/sam files to get a first estimate of the quality of the data. It is highly recommended to set the restriction enzyme and dangling end parameter to get a good quality report.

usage: hicQuickQC --samFiles two sam files two sam files --QCfolder FOLDER
                  --restrictionCutFile BED file [BED file ...]
                  --restrictionSequence RESTRICTIONSEQUENCE
                  [RESTRICTIONSEQUENCE ...] --danglingSequence
                  DANGLINGSEQUENCE [DANGLINGSEQUENCE ...] [--lines LINES]
                  [--help] [--version]
Required arguments
--samFiles, -s

The two PE alignment sam files to process.

--QCfolder

Path of folder to save the quality control data of the matrix. The log files produced this way can be loaded into hicQC in order to compare the quality of multiple Hi-C libraries.

--restrictionCutFile, -rs

BED file(s) with all restriction cut places (output of “findRestSite” command). Should contain only mappable restriction sites. If given, the bins are set to match the restriction fragments (i.e. the region between one restriction site and the next). Alternatively, a fixed binSize can be defined instead. However, either binSize or restrictionCutFile must be defined. To use more than one restriction enzyme, generate for each one a restrictionCutFile and list them space seperated.

--restrictionSequence, -seq

Sequence of the restriction site, if multiple are used, please list them space seperated. If a dangling sequence is listed at the same time, please preserve the same order.

--danglingSequence

Sequence left by the restriction enzyme after cutting, if multiple are used, please list them space seperated and preserve the order. Each restriction enzyme recognizes a different DNA sequence and, after cutting, they leave behind a specific “sticky” end or dangling end sequence. For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT. For DpnII, the restriction site and dangling end sequence are the same: GATC. This information is easily found on the description of the restriction enzyme. The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads. A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol.

Optional arguments
--lines

Number of lines to consider for the QC test run.

Default: 1000000

--version

show program’s version number and exit

For more information see encode .

hicQC
Background

This tool can process quality control log files produced by hicBuildMatrix for multiple samples at once to generate summary tables and plots of quality control (QC) measures for all these samples. Additionally, an HTML output is generated where all summary tables and plots are displayed.

Description

Tabulates and plots QC measures from hicBuildMatrix log files within an HTML output

usage: hicQC --logfiles matrix1_QCfolder/QC.log matrix2_QCfolder/QC.log --labels "sample 1" "sample 2" --outputFolder QC_all_samples)
Required arguments
--logfiles, -l

Path to the log files to be processed

--labels

Label to assign to each log file. Each label should be separated by a space. Quote labels that contain spaces: E.g. –labels label1 “labels 2”

--outputFolder, -o

Several files with be saved under this folder: A table containing the results and a html file with several images.

Optional arguments
--dpi

Image resolution. By default high resolution png images with a 200 dpi are created.

Default: 200

--version

show program’s version number and exit

Usage example

hicBuildMatrix generates a QC.log file for each processed Hi-C sample in a folder specified in the --QCfolder argument. The quality control measures stored for each of these samples can be merged in summary tables and plots using hicQC. An example usage is:

$ hicQC --logfiles ./sample_1/QC.log ./sample_2/QC.log /sample_3/QC.log \
--labels "Sample 1" "Sample 2" "Sample 3" \
-o QC_plots

This command will generate several plots that are described in details below.

_images/pairs_sequenced.png

On the plot above, we can see how many reads were sequenced per sample (Sequenced reads), how many reads were mappable, unique and of high quality and how many reads passed all quality controls and are thus useful for further analysis (Hi-C contacts). All quality controls used for read filtering are explained below.

_images/unmappable_and_non_unique.png

The figure above contains the fraction of reads with respect to the total number of reads that did not map, that have a low quality score or that didn’t map uniquely to the genome. In our example we can see that Sample 3 has the highest fraction of Hi-C contacts. We explain the differences between the three samples on the plot below.

_images/pairs_discarded.png

This figure contains the fraction of read pairs (with respect to mappable and unique reads) that were discarded when building the Hi-C matrix. You can find the description of each category below:

Dangling ends: reads that start with the restriction site and constitute reads that were digested but not ligated. Sample 1 in our example has a high fraction of dangling ends (and thus a low proportion of Hi-C contacts). Reasons for this can be inefficient ligation or insufficient removal of danging ends during samples preparation.

Duplicated pairs: reads that have the same sequence due to PCR amplification. For example, Sample 2 was amplified too much and thus has a very high fraction of duplicated pairs.

Same fragment: read mates facing inward, separated by up to 800bp that do not have a restriction enzyme site in between. These read pairs are not valid Hi-C pairs and are thus discarded from further analyses.

Self circle: read pairs within 25kb with ‘outward’ read orientation.

Self ligation: read pairs with a restriction site in between that are within 800bp.

_images/distance.png

The figure above contains the fraction of read pairs (with respect to mappable reads) that compose inter chromosomal, intra short range (< 20kb) or intra long range (>= 20kb) contacts. Inter chromosomal reads of a wild-type sample are expected to be low. Trans-chromosomal contacts can be primarily considered as random ligation events. These would be expected to contribute to technical noise that may obscure some of the finer features in the Hi-C datasets (Nagano et al. 2015, Comparison of Hi-C results using in-solution versus in-nucleus ligation, doi: https://doi.org/10.1186/s13059-015-0753-7). As such, a high fraction of inter chromosomal reads is an indicator of low sample quality, but it can also be associated to cell cycle changes (Nagano et al. 2018, Cell-cycle dynamics of chromosomal organization at single-cell resolution, doi: https://doi.org/10.1038/nature23001).

Short range and long range contacts proportions can be associated to how the fixation is performed during Hi-C sample preparation. These two proportions also directly impact the Hi-C corrected counts versus genomic distance plots generated by hicPlotDistVsCounts.

_images/read_orientation.png

The last figure shows the fractions of the read pair types: inward, outward, left or right read pairs (with respect to mappable reads). Deviations from an equal distribution indicates problems during sample preparation.

hicCorrelate
Background

hicCorrelate is a dedicated Quality Control tool that allows the correlation of multiple Hi-C matrices at once with either a heatmap or scatter plots output.

Description

Computes pairwise correlations between Hi-C matrices data. The correlation is computed taking the values from each pair of matrices and discarding values that are zero in both matrices.Parameters that strongly affect correlations are bin size of the Hi-C matrices and the considered range. The smaller the bin size of the matrices, the finer differences you score. The –range parameter should be selected at a meaningful genomic scale according to, for example, the mean size of the TADs in the organism you work with.

usage: hicCorrelate --matrices MATRICES [MATRICES ...] [--zMin ZMIN]
                    [--zMax ZMAX] [--colorMap] [--plotFileFormat FILETYPE]
                    [--plotNumbers] [--method {pearson,spearman}] [--log1p]
                    [--labels sample1 sample2 [sample1 sample2 ...]]
                    [--range RANGE] --outFileNameHeatmap OUTFILENAMEHEATMAP
                    --outFileNameScatter OUTFILENAMESCATTER
                    [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                    [--threads THREADS] [--help] [--version]
Required arguments
--matrices, -m

Matrices to correlate (usually .h5 but other formats are allowed). hicCorrelate is better used on un-corrected matrices in order to exclude any changes introduced by the correction.

Heatmap arguments

Options for generating the correlation heatmap

--zMin, -min

Minimum value for the heatmap intensities. If not specified the value is set automatically.

--zMax, -max

Maximum value for the heatmap intensities.If not specified the value is set automatically.

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html

Default: “jet”

--plotFileFormat

Possible choices: png, pdf, svg, eps, emf

Image format type. If given, this option overrides the image format based on the plotFile ending. The available options are: png, emf, eps, pdf and svg.

--plotNumbers

If set, then the correlation number is plotted on top of the heatmap.

Default: False

Optional arguments
--method

Possible choices: pearson, spearman

Correlation method to use.

Default: “pearson”

--log1p

If set, then the log1p of the matrix values is used. This parameter has no effect for Spearman correlations but changes the output of Pearson correlation and, for the scatter plot, if set, the visualization of the values is easier.

Default: False

--labels, -l

User defined labels instead of default labels from file names. Multiple labels have to be separated by space, e.g. –labels sample1 sample2 sample3

--range

In bp with the format low_range:high_range, for example 1000000:2000000. If –range is given only counts within this range are considered. The range should be adjusted to the size of interacting domains in the genome you are working with.

--outFileNameHeatmap, -oh

File name to save the resulting heatmap plot.

--outFileNameScatter, -os

File name to save the resulting scatter plot.

--chromosomes

List of chromosomes to be included in the correlation.

--threads

Number of threads. Using the python multiprocessing module. Is only used with ‘cool’ matrix format. One master process which is used to read the input file into the buffer and one process which is merging the output bam files of the processes into one output bam file. All other threads do the actual computation.

Default: 4

--version

show program’s version number and exit

Usage example

Below, you can find a correlation example of uncorrected Hi-C matrices obtained from Drosophila melanogaster embryos, either wild-type or having one gene knocked-down by RNAi.

$ hicCorrelate -m Dmel_wt_1.h5 Dmel_wt_2.h5 Dmel_kd_1.h5 Dmel_kd_2.h5 \
--method=pearson --log1p \
--labels Dmel_wt_1 Dmel_wt_2 Dmel_kd_1 Dmel_kd_2 \
--range 5000:200000 \
--outFileNameHeatmap Dmel_heatmap --outFileNameScatter Dmel_scatterplot \
--plotFileFormat png
Heatmap _images/Dmel_heatmap.png

This example is showing a heatmap that was calculated using the Pearson correlation of un-corrected Hi-C matrices with a bin size of 6000 bp. The dendrogram indicates which samples are most similar to each other. You can see that the wild-type samples are separated from the knock-down samples. The second option we offer is calculating the Spearman correlation.

Scatter plot

Additionally, pairwise scatter plots comparing interactions between each sample can be plotted.

_images/Dmel_scatterplot.png
hicPlotDistVsCounts
Background

This tool allows a quick comparison between multiple Hi-C matrices of the Hi-C counts enrichment at different genomic ranges / distances up to whole chromosomes. Biological replicates should display the exact same distribution while samples coming from different cell-lines, treated versus untreated samples or mutant versus wild-type samples can display a different distribution at long and/or close ranges.

The results of this tool usually reflect the proportion of long-range and short-range contacts calculated in each sample by hicQC. Local TAD or contact enrichments will not impact the results computed by this tool, hicPCA is better suited for that purpose.

Description

This program creates distance vs. Hi-C counts plots. It can use several matrix files to compare them at once. If the –perchr option is given, each chromosome is plotted independently. When plotting multiple matrices, denser matrices are scaled down to match the sum of the smallest matrix.

usage: hicPlotDistVsCounts --matrices MATRICES [MATRICES ...] --plotFile file
                           name [--labels LABELS [LABELS ...]]
                           [--skipDiagonal] [--maxdepth INT bp] [--perchr]
                           [--chromosomeExclude CHROMOSOMEEXCLUDE [CHROMOSOMEEXCLUDE ...]]
                           [--outFileData OUTFILEDATA]
                           [--plotsize PLOTSIZE PLOTSIZE] [--help] [--version]
Required arguments
--matrices, -m

Hi-C normalized (corrected) matrices. Each path should be separated by a space.

--plotFile, -o

File name to save the file. The given file ending will be used to determine the image format. The available options are: .png, .emf, .eps, .pdf and .svg.

Optional arguments
--labels

Label to assign to each matrix file. Each label should be separated by a space. Quote labels that contain spaces: E.g. –labels label1 “labels 2”. If no labels are given then the file name is used.

--skipDiagonal, -s

If set, diagonal counts are not included.

Default: False

--maxdepth

Maximum distance from diagonal to use. In other words, distances up to maxDepth are computed. Default is 3 million bp.

Default: 3000000

--perchr

If given, computes and display distance versus Hi-C counts plots for each chromosome stored in the matrices passed to –matrices.

Default: False

--chromosomeExclude

Exclude the given list of chromosomes. This is useful for example to exclude the Y chromosome. The names of the chromosomes should be separated by space.

--outFileData

If given, the data underlying the plots is saved on this file.

--plotsize

Width and height of the plot (in inches). Default is 6*number of cols, 4 * number of rows. The maximum number of rows is 4. Example: –plotsize 6 5

--version

show program’s version number and exit

Usage example

hicPlotDistVsCounts should be used on corrected matrices with large bins (e.g. at least 50kb bins), otherwise the curves will be very spiky and unstable at longer ranges because of the sparseness of the contacts. The tool hicMergeMatrixBins can be used to merge matrix bins and the tool hicCorrectMatrix can be used for matrix correction before using hicPlotDistVsCounts.

hicPlotDistVsCounts -m \
condition1_sample1_50_bins_merged.h5 \
condition1_sampel2_50_bins_merged.h5 \
condition2_sample1_50_bins_merged.h5 \
condition2_sample2_50_bins_merged.h5 \
-o counts_vs_dist_50_bins_merged.png \
--labels 'Cond 1 Sample 1' 'Cond 1 Sample 2' 'Cond 2 Sample 1' 'Cond 2 Sample 2' \
--maxdepth 20000000 \
--plotsize 5 4.2
_images/counts_vs_dist_50_bins_merged.png

Here, we see that the samples of the first condition are not so well correlated, but they follow the same tendencies and are distinct from the two samples of the second condition. The later are well correlated and display enriched long-range contacts compared to the samples of the first condition.

hicInfo

Prints information about a matrix or matrices including matrix size, number of elements, sum of elements, etc. An example usage is: $ hicInfo -m matrix1.h5 matrix2.h5 matrix3.h5

usage: hicInfo --matrices MATRICES [MATRICES ...] [--outFileName OUTFILENAME]
               [--no_metadata] [--help] [--version]
Required arguments
--matrices, -m

The matrix (or multiple matrices) to get information about. HiCExplorer supports the following file formats: h5 (native HiCExplorer format) and cool.

Optional arguments
--outFileName, -o

File name to save information of the matrix instead of writing it to the bash.

--no_metadata, -nm

Do not use meta data from cooler file to display information. This method is slower and was the default until version 2.2 of HiCExplorer. H5 files always use this parameter.

Default: True

--version

show program’s version number and exit

An example output looks like this:

File:        hic_matrix.h5
WARNING: bin size is not homogeneous. Median 434
Size:        339,684
Sum: 243,779,407
Bin length:  434
Chromosomes: 2L, 2R, 3L, 3R, 4, X
Non-zero elements:   229,122,848
Minimum:     0
Maximum:     2116
NaN bins:    0

With HiCExplorer version 3 we support metadata in cooler files. The default behavior is to use the metadata:

$ hicInfo -m matrix.cool

To use the old method (and default for h5) please add the parameter:

$ hicInfo -m matrix.cool --no_metadata

Tools for Hi-C data analysis

hicCompareMatrices
Background

This tool is useful to compare two matrices in .h5 format by applying operations like difference, ratio or log2ratio after normalization. This can be used to determine the effect of a mutation compared to wild-type samples on contact enrichment, or to see TAD structure modifications near differentially expressed genes between two conditions when followed by hicPlotMatrix. It can also be used to compare two biological replicates.

Description

Takes two matrices as input, normalizes them and applies the given operation. To normalize the matrices each element is divided by the sum of the matrix.

usage: hicCompareMatrices --matrices matrix.h5 matrix.h5 --outFileName
                          OUTFILENAME [--operation {diff,ratio,log2ratio}]
                          [--help] [--version]
Required arguments
--matrices, -m

Name of the matrices in .h5 format to use, separated by a space.

--outFileName, -o

File name to save the resulting matrix. The output is also a .h5 file.

Optional arguments
--operation

Possible choices: diff, ratio, log2ratio

Operation to apply to the matrices.

Default: “log2ratio”

--version

show program’s version number and exit

Usage example

hicCompareMatrices is usually perfomed on corrected matrices (hicCorrectMatrix) with bins merged (hicMergeMatrixBins) depending on the downstream analyses to perform. Here is an example of a log2ratio comparison between M1BP Knockdown and GST cells in Drosophila melanogaster on corrected matrices with 50 bins merged (about 30kb bins).

hicCompareMatrices -m \
M1BP_KD_merge_m50_corrected.h5 \
GST_merge_rf_m50_corrected.h5 \
--operation log2ratio -o m1bp_over_gst_log2_m50.h5

This code outputs a matrix containing the normalized log2ratio values of M1BP_KD_merge_m50_corrected.h5 over GST_merge_rf_m50_corrected.h5. We can then display this matrix using hicPlotMatrix.

hicPlotMatrix -m \
m1bp_over_gst_log2_m50.h5 \
--clearMaskedBins \
--region chr2L:12,000,000-19,000,000 \
--vMin -4 --vMax 4 \
-o m1bp_over_gst_log2_m50_matrix_plot.png
_images/hicCompareMatrices_m1bp_over_gst_log2_m50_matrix_plot.png

In this plot we see that the cells with a M1BP Knockdown display a negative log2ratio compared to the wild-type. Depletion of M1BP thus show a dramatic effect on the distribution of Hi-C contacts in which short range contacts decrease (Ramirez et al. 2017, High-resolution TADs reveal DNA sequences underlying genome organization in flies, https://doi.org/10.1038/s41467-017-02525-w).

Below you can find an example of a log2ratio plot between Hi-C matrices of two biological replicates, no differences are observable which means that the replicates are well correlated.

_images/hicCompareMatrices_QC_log2_m50_matrix_plot.png
hicPCA

Computes PCA eigenvectors for a Hi-C matrix.

$ hicPCA –matrix hic_matrix.h5 -o pca1.bedgraph pca2.bedgraph

usage: hicPCA --matrix MATRIX --outputFileName OUTPUTFILENAME
              [OUTPUTFILENAME ...]
              [--numberOfEigenvectors NUMBEROFEIGENVECTORS]
              [--format {bedgraph,bigwig}]
              [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
              [--method {dist_norm,lieberman}] [--ligation_factor]
              [--extraTrack EXTRATRACK] [--histonMarkType HISTONMARKTYPE]
              [--pearsonMatrix PEARSONMATRIX] [--obsexpMatrix OBSEXPMATRIX]
              [--ignoreMaskedBins] [--help] [--version]
Required arguments
--matrix, -m

HiCExplorer matrix in h5 format.

--outputFileName, -o

File names for the result of the pca.Number of output files must match the number of computed eigenvectors.

Default: [‘pca1’, ‘pca2’]

Optional arguments
--numberOfEigenvectors, -noe

The number of eigenvectors that the PCA should compute.

Default: 2

--format, -f

Possible choices: bedgraph, bigwig

Output format. Either bedgraph or bigwig.

Default: “bigwig”

--chromosomes

List of chromosomes to be included in the correlation.

--method

Possible choices: dist_norm, lieberman

possible methods which can be used to build the obs-exp matrix.

Default: “dist_norm”

--ligation_factor

Setting this flag multiplies a scaling factor to each entry of the expected matrix to take care of the proximity ligation as has been explained in Homer software. This flag is only affective with dist_norm method and will be ignored if lieberman method is chosen.

Default: False

--extraTrack

Either a gene track or a histon mark coverage file (preferably a broad mark) is needed to decide if the values of the eigenvector need a sign flip or not.

--histonMarkType

Set it to active or inactive. This is only necessary if a histon mark coverage file is given as an extraTrack.

Default: “active”

--pearsonMatrix, -pm

Internally the input matrix is converted per chromosome to obs_exp matrix and consecutively to a Pearson matrix. Set this parameter to write the pearson matrix to a file.

--obsexpMatrix, -oem

Internally the input matrix is converted per chromosome to obs_exp matrix and consecutively to a Pearson matrix. Set this parameter to write the observe/expected matrix to a file.

--ignoreMaskedBins

Mask bins are usually set to 0. This option removes the masked bins before the PCA is computed. Attention: this will lead to empty PCA regions.

Default: False

--version

show program’s version number and exit

hicTransform

Converts the (interaction) matrix to different types of obs/exp, pearson or covariance matrix.

usage: hicTransform --matrix MATRIX --outFileName OUTFILENAME
                    [--method {obs_exp,obs_exp_lieberman,obs_exp_non_zero,pearson,covariance}]
                    [--ligation_factor]
                    [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                    [--perChromosome] [--help] [--version]
Required arguments
--matrix, -m

input file. The computation is done per chromosome.

--outFileName, -o

File name to save the exported matrix.

Optional arguments
--method, -me

Possible choices: obs_exp, obs_exp_lieberman, obs_exp_non_zero, pearson, covariance

Transformation methods to use for input matrix. Transformation is computed per chromosome.obs_exp computes the expected matrix as the sum per genomic distance j divided by maximal possible contacts: sum(diagonal(j) / number of elements in diagonal(j) obs_exp_lieberman computes the expected matrix as the sum per genomic distance j divided by the : sum(diagonal(j) / (length of chromosome - j))obs_exp_non_zero computes the expected matrix as the sum per genomic distance j divided by sum of non-zero contacts: sum(diagonal(j) / number of non-zero elements in diagonal(j)Optionally, ``–ligation_factor` can be used for this method as has been used by HOMER software. If –ligation_factor, then exp_i,j = exp_i,j * sum(row(i)) * sum(row(j)) / sum(matrix)pearson computes the Pearson correlation of the input matrix: Pearson_i,j = C_i,j / sqrt(C_i,i * C_j,j) and C is the covariance matrixcovariance computes the Covariance of the input matrix: Cov_i,j = E[M_i, M_j] - my_i * my_j where M is the input matrix and my the mean.

Default: “obs_exp”

--ligation_factor

Setting this flag, multiplies a scaling factor to each entry of the expected matrix to take care of the proximity ligation as has been explained in Homer software. This flag is only affective with obs_exp_non_zero method and will be ignored if any other obs/exp method is chosen.

Default: False

--chromosomes

List of chromosomes to be included in the computation.

--perChromosome, -pc

Each chromosome is processed individually, inter-chromosomal interactions are ignored. Option not valid for obs_exp_lieberman.

Default: False

--version

show program’s version number and exit

Background

hicTransform transforms a given input matrix into a new matrix using one of the following methods:

  • obs_exp

  • obs_exp_lieberman

  • obs_exp_non_zero

  • pearson

  • covariance

All expected values are computed per genomic distances.

Usage
$ hicTransform -m matrix.cool --method obs_exp -o obs_exp.cool

For all images data from Rao 2014 was used.

Observed / Expected

All values, including non-zero values, are used to compute the expected values per genomic distance.

\[exp_{i,j} = \frac{ \sum diagonal(|i-j|) }{|diagonal(|i-j|)|}\]
_images/obs_exp.png
Observed / Expected lieberman

The expected matrix is computed in the way as Lieberman-Aiden used it in the 2009 publication. It is quite similar to the obs/exp matrix computation.

\[exp_{i,j} = \frac{ \sum diagonal(|i-j|) } {(length\ of\ chromosome\ - |i-j|))}\]
_images/obs_exp_lieberman.png
Observed / Expected non zero

Only non-zero values are used to compute the expected values per genomic distance, i.e. only non-zero values are taken into account for the denominator.

\[exp_{i,j} = \frac{ \sum diagonal(i-j) }{ number\ of\ non-zero\ elements\ in\ diagonal(|i-j|)}\]
_images/obs_exp_norm.png

By adding the –ligation_factor flag, the expected matrix can be re-scaled in the same way as has been done by Homer software when computing observed/expected matrices with the option ‘-norm’.

\[exp_{i,j} = exp_{i,j} * \sum row(j) * \sum row(i) }{ \sum matrix }\]
_images/obs_exp_norm.png
Pearson correlation matrix
\[Pearson_{i,j} = \frac {C_{i,j} }{ \sqrt{C_{i,i} * C_{j,j} }}\]

C is the covariance matrix

_images/pearson.png _images/obs_exp_pearson.png

The first image shows the Pearson correlation on the original interaction matrix, the second one shows the Person correlation matrix on an observed/expected matrix. A consecutive computation like this is used in the A/B compartment computation.

Covariance matrix
\[Cov_{i,j} = E[M_i, M_j] - \mu_i * \mu_j\]

where M is the input matrix and \(\mu\) the mean.

hicAverageRegions

Sums Hi-C contacts around given reference points and computes their average. This tool is useful to detect differences at certain reference points as for example TAD boundaries between samples.

WARNING: This tool can only be used with fixed bin size Hi-C matrices. No guarantees how and if it works on restriction site interaction matrices.

usage: hicAverageRegions --matrix MATRIX --regions REGIONS
                         (--range RANGE RANGE | --rangeInBins RANGEINBINS RANGEINBINS)
                         --outFileName OUTFILENAME [--help]
                         [--coordinatesToBinMapping {start,center,end}]
                         [--version]
Named Arguments
--range, -ra

Range of region up- and downstream of each region to include in genomic units.

--rangeInBins, -rib

Range of region up- and downstream of each region to include in bin units.

Required arguments
--matrix, -m

The matrix to use for the average of TAD regions.

--regions, -r

BED file which stores a list of regions that are summed and averaged

--outFileName, -o

File name to save the average regions TADs matrix.

Optional arguments
--coordinatesToBinMapping, -cb

Possible choices: start, center, end

If the region contains start and end coordinates, define if the start, center (start + (end-start) / 2) or end bin should be used as start for range.This parameter is only important to set if the given start and end coordinates are not in the same bin.

Default: “start”

--version

show program’s version number and exit

hicAverageRegions takes as input a BED files with genomic positions, these are the reference points to sum up and average the regions up- and downstream of all these positions, good reference points are e.g. the borders of TADs. This can help to determine changes in the chromatin organization and TAD structure changes.

In the following example the 10kb resolution interaction matrix of Rao 2014 is used.

The first step computes the TADs for chromosome 1.

$ hicFindTADs -m GSE63525_GM12878_insitu_primary_10kb_KR.cool --outPrefix TADs
    --correctForMultipleTesting fdr --minDepth 30000 --maxDepth 100000
    --step 10000 -p 20 --chromosomes 1

Next, we use the domains.bed file of hicFindTADs to use the borders of TADs as reference points. As a range up- and downstream of each reference point 100kb are chosen.

$ hicAverageRegions -m GSE63525_GM12878_insitu_primary_10kb_KR.cool
    -r TADs_domains.bed --range 100000 100000 --outFileName primary_chr1

In a last step, the computed average region is plotted.

$ hicPlotAverageRegions -m primary_chr1.npz -o primary_chr1.png
_images/primary_chr1.png
hicDetectLoops

hicDetectLoops can detect enriched interaction regions (peaks / loops) based on a strict candidate selection, negative binomial distributions and Wilcoxon rank-sum tests.

The algorithm was mainly develop on GM12878 cells from Rao 2014 on 10kb and 5kb fixed bin size resolution.

Example usage
$ hicDetectLoops -m matrix.cool -o loops.bedgraph --maxLoopDistance 2000000 --windowSize 10 --peakWidth 6 --pValuePreselection 0.05 --pValue 0.05

The candidate selection is based on the restriction of the maximum genomic distance, here 2MB. This distance is given by Rao 2014. For each genomic distance a continuous negative binomial distribution is computed and only interaction pairs with a threshold less than --pValuePreselection are accepted. In a second step, each candidate is considered compared to its neighborhood. This neighborhood is defined by the --windowSize parameter in the x and y dimension. Per neighborhood only one candidate is considered, therefore only the candidate with the highest peak values is accepted. As a last step, the neighborhood is split into a peak and background region (parameter --peakWidth). The peakWidth can never be larger than the windowSize. However, we recommend for 10kb matrices a windowSize of 10 and a peakWidth of 6.

With version 3.5 a major revision of this tool was published. The biggest changes are:

  • The introduction of an observed/expected matrix as one of the first steps. Based on it, the other calculations compute without a distance dependent factor and the results are more accurate.

  • The testing of peak vs background region with the donut layout as proposed by HiCCUPS with Wilcoxon rank-sum test. Anderson-Darling test is removed.

  • Improving the handling of the parallelization and rewrote of the merging of candidates in one neighborhood. Results in faster execution time and less memory demand.

  • Loading only the interactions within the range of maxLoopDistance. This is possible with HiCMatrix version 13 and results in faster load time and a reduced memory peak. This improvement is only for cool matrices, h5 matrices do not profit from these changes.

hicDetectLoops has many parameters and it can be quite difficult to search for the best parameter setting for your data. With version 3.5 we introduce therefore two new tools hicHyperoptDetectLoops and hicHyperoptDetectLoopsHiCCUPS. The first one searches for the optimal parameter setting for your data based on HiCExplorer’s hicDetectLoops. However, if you want to compare the results to Juicer HiCCUPS, the second tool provides a parameter search for it. Please note that HiCCUPS and any dependency of it are not provided by HiCExplorer and must be installed by the user on its own.

The output file (´´-o loops.bedgraph``) contains the x and y position of each loop and its corresponding p-value of the statistical test.

1   120000000       122500000       1       145000000       147500000       0.001

The results can visualized using hicPlotMatrix:

$ hicPlotMatrix -m matrix.cool -o plot.png --log1p --region 1:18000000-22000000 --loops loops.bedgraph
_images/hicDetectLoops.png

Computes enriched regions (peaks) or long range contacts on the given contact matrix.

usage: hicDetectLoops --matrix MATRIX --outFileName OUTFILENAME
                      [--peakWidth PEAKWIDTH] [--windowSize WINDOWSIZE]
                      [--pValuePreselection PVALUEPRESELECTION]
                      [--peakInteractionsThreshold PEAKINTERACTIONSTHRESHOLD]
                      [--obsExpThreshold OBSEXPTHRESHOLD] [--pValue PVALUE]
                      [--maxLoopDistance MAXLOOPDISTANCE]
                      [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                      [--threads THREADS]
                      [--threadsPerChromosome THREADSPERCHROMOSOME]
                      [--expected {mean,mean_nonzero,mean_nonzero_ligation}]
                      [--help] [--version]
Required arguments
--matrix, -m

The matrix to compute the loop detection on.

--outFileName, -o

Outfile name to store the detected loops. The file will in bedgraph format.

Optional arguments
--peakWidth, -pw

The width of the peak region in bins. Default is 2. The square around the peak will include (2 * peakWidth)^2 bins.

Default: 2

--windowSize, -w

The window size for the neighborhood region the peak is located in. Default is 5. All values from this region (exclude the values from the peak region) are tested against the peak region for significant difference. The square will have the size of (2 * windowSize)^2 bins

Default: 5

--pValuePreselection, -pp

Only candidates with p-values less the given threshold will be considered as candidates. For each genomic distance a negative binomial distribution is fitted and for each pixel a p-value given by the cumulative density function is given. This does NOT influence the p-value for the neighborhood testing. Can a single value or a threshold file created by hicCreateThresholdFile.

Default: 0.1

--peakInteractionsThreshold, -pit

The minimum number of interactions a detected peaks needs to have to be considered.

Default: 10

--obsExpThreshold, -oet

The minimum number of obs/exp interactions a detected peaks needs to have to be considered.

Default: 1.5

--pValue, -p

Rejection level for Anderson-Darling or Wilcoxon-rank sum test for H0. H0 is peak region and background have the same distribution.

Default: 0.025

--maxLoopDistance

Maximum genomic distance of a loop, usually loops are within a distance of ~2MB.

Default: 2000000

--chromosomes

Chromosomes to include in the analysis. If not set, all chromosomes are included.

--threads, -t

Number of threads to use, the parallelization is implemented per chromosome.

Default: 4

--threadsPerChromosome, -tpc

Number of threads to use per parallel thread processing a chromosome. E.g. –threads = 4 and –threadsPerChromosome = 4 makes 4 * 4 = 16 threads in total.

Default: 4

--expected, -exp

Possible choices: mean, mean_nonzero, mean_nonzero_ligation

Method to compute the expected value per distance: Either the mean, the mean of non-zero values or the mean of non-zero values with ligation factor correction.

Default: “mean”

--version

show program’s version number and exit

hicValidateLocations

hicValidateLoops is a tool to compare the detect loops from hicDetectLoops (or from any other software as long as the data format is followed, see below) with known peak protein locations to validate if the computed loops do have the expected anchor points. For example, loops in mammals are usually bound by CTCF or Cohesin, therefore it is important to know if the detect loops have protein peaks at their X and Y position.

_images/loops_bonev_cavalli.png

Loops in Hi-C, graphic from Bonev & Cavalli, Nature Reviews Genetics 2016

Data format

The data format of hicDetectLoops output is:

chr_x start_x end_x chr_y start_y end_y p-value

As --protein the input of narrowPeak or broadPeak files are accepted. However, as long as the --protein input file contains chromosome, start and end in the first three columns, it should work.

This script overlaps the loop locations with protein locations to determine the accuracy of the loop detection. Loops need to have format as follows:

chr start end chr start end

The protein peaks need to be in narrowPeaks or broadPeak format.

A protein match is successfull if at the bin of the x and y location a protein peak is overlapped. A bin is assumed to have a protein if one or more protein peaks falling within the bin region. The value of the protein is not considered, only match or non-match.

usage: hicValidateLocations --data DATA --protein PROTEIN [--method {loops}]
                            --resolution RESOLUTION
                            [--outFileName OUTFILENAME] [--addChrPrefixLoops]
                            [--addChrPrefixProtein] [--help] [--version]
Required arguments
--data, -d

The loop file from hicDetectLoops. To use files from other sources, please follow ‘chr start end chr start end’ format.

--protein, -p

The protein peak file. Can be narrowPeak or broadPeak

--method, -m

Possible choices: loops

The loop file

Default: “loops”

--resolution, -r

The used resolution of the Hi-C interaction matrix.

Optional arguments
--outFileName, -o

The prefix name of the output files. Two file are written: output_matched_locations and output_statistics.First file contains all loop locations with protein location matches, second file contains statistics about this matching.

--addChrPrefixLoops, -cl

Adding a ‘chr’-prefix to chromosome name of the loops.

Default: False

--addChrPrefixProtein, -cp

Adding a ‘chr’-prefix to chromosome name of the protein.

Default: False

--version

show program’s version number and exit

hicMergeLoops

This script merges the locations of loops detected at several resolutions.

Loops in the inputFiles need to have the following format:

chr start end chr start end

Loops are merged if the x and y position of a loop overlap with the x and y position of another loop; all loops are considered as an overlap within +/- the bin size of the lowest resolution. I.e. for a loop with coordinates x and y, the overlap with all other loops is checked for (x - lowest resolution) and (y + lowest resolution). If two or more locations are to be merged, the loop at the lowest resolution is taken as the merged loop.

Example usage:

$ hicMergeLoops -i gm12878_10kb.bedgraph gm12878_5kb.bedgraph gm12878_25kb.bedgraph -o merged_result.bedgraph -r 25000

Please recall: We work with binned data i.e. the lowest resolution is therefore the one where we merge the most bases into one bin. In the above example the lowest resultion is 25kb, the highest resolution is 5kb.

usage: hicMergeLoops --inputFiles INPUTFILES [INPUTFILES ...] --outFileName
                     OUTFILENAME --lowestResolution LOWESTRESOLUTION [--help]
                     [--version]
Required arguments
--inputFiles, -i

The loop files from hicDetectLoops. To use files from other sources, please follow ‘chr start end chr start end’ format and remove any header.

--outFileName, -o

The name of the merged loop file.

--lowestResolution, -r

The lowest resolution of all loop files, i.e. 5kb, 10kb and 25kb, please use 25000.

Optional arguments
--version

show program’s version number and exit

hicHyperoptDetectLoops

usage: hicHyperoptDetectLoops --matrix MATRIX --proteinFile PROTEINFILE
                              --maximumNumberOfLoops MAXIMUMNUMBEROFLOOPS
                              [--outputFileName OUTPUTFILENAME]
                              [--resolution RESOLUTION] [--threads THREADS]
                              [--runs RUNS] [--help] [--version]
Required arguments
--matrix, -m

The matrix to compute the loops on.

--proteinFile, -p

The protein file to validate the detected loops

--maximumNumberOfLoops, -ml

The maximum number of loops that should be used for optimization computation.

--outputFileName, -o

File names for the result of the optimization.

Default: “hyperopt_result.txt”

Optional arguments
--resolution, -re

Resolution of matrix

Default: 10000

--threads, -t

Number of threads (uses the python multiprocessing module).

Default: 4

--runs, -r

Number of runs of hyperopt.

Default: 100

--version

show program’s version number and exit

hicHyperoptDetectLoopsHiCCUPS

usage: hicHyperoptDetectLoopsHiCCUPS --matrix MATRIX --proteinFile PROTEINFILE
                                     --maximumNumberOfLoops
                                     MAXIMUMNUMBEROFLOOPS --juicerPath
                                     JUICERPATH
                                     [--outputFileName OUTPUTFILENAME]
                                     [--resolution RESOLUTION] [--runs RUNS]
                                     [--threads THREADS] --normalization
                                     NORMALIZATION [--cpu] [--restricted]
                                     [--help] [--version]
Required arguments
--matrix, -m

The matrix to compute the loops on.

--proteinFile, -p

The protein file to validate the detected loops

--maximumNumberOfLoops, -ml

The maximum number of loops that should be used for optimization computation.

--juicerPath, -j

path to juicer.jar

--outputFileName, -o

File names for the result of the optimization.

Default: “hyperoptHiCCUPS_result.txt”

Optional arguments
--resolution, -r

Resolution of matrix

Default: 10000

--runs

Number of runs of hyperopt.

Default: 100

--threads, -t

Number of threads (uses the python multiprocessing module).

Default: 4

--normalization, -k

Normalization table name.

Default: “KR”

--cpu

use the CPU version

Default: False

--restricted

If the GPU version is used, search only within 8 MB.

Default: False

--version

show program’s version number and exit

hicCompartmentalization

Rearrange the average interaction frequencies using the first PC values to represent the global compartmentalization signal. To our knowledge this has been first introduced and implemented by Wibke Schwarzer et al. 2017 (Nature. 2017 Nov 2; 551(7678): 51–56)

$ hicCompartmentalization –obsexp_matrices obsExpMatrix.h5 –pca pc1.bedgraph -o global_signal.png

usage: hicCompartmentalization --obsexp_matrices OBSEXP_MATRICES
                               [OBSEXP_MATRICES ...] --pca PCA
                               --outputFileName OUTPUTFILENAME
                               [--quantile QUANTILE] [--outliers OUTLIERS]
                               [--outputMatrix OUTPUTMATRIX]
                               [--offset OFFSET [OFFSET ...]] [-h] [--version]
Required arguments
--obsexp_matrices, -m

HiCExplorer matrices in h5/cool format.

--pca

a PCA vector as a bedgraph file with no header. In case of several matrices with different conditions, ie. controltreatment, the PCA of control can be used. Note that only one PCA can be provided.

--outputFileName, -o

Plot to represent the polarization of A/B compartments.

Optional arguments
--quantile, -q

number of quantiles. (Default: 30).

Default: 30

--outliers

precentage of outlier to remove. (Default: 0).

Default: 0

--outputMatrix

output .npz file includes all the generated matrices

--offset

set nan for the distances mentioned as offset from main diagonal, only positive values are accepted! Example: if –offset 0, then values of main diagonal will set to nan and if –offset 0 1 then on top of the main diagonal, +1 and -1 diagonal values are also set to nan.

--version

show program’s version number and exit

PCA to compute the global compartmentalization signal

To our knowledge this method has been first introduced by Wibke Schwarzer et al. 2017 (Nature. 2017 Nov 2; 551(7678): 51–56). In this method, a global (genome-wide) strength for compartmentalization is computed as (AA + BB) / (AB + BA) after rearranging the bins of obs/exp based on their corresponding pc1 values. For this purpose, first pc1 values are reordered incrementally, then the same order of bins is used to rearrange the bins in obs/exp matrix.

$ _hicCompartmentalization --obsexp_matrices obsExpMatrix.h5 --pca pc1.bedgraph
  -o global_signal.png
hicPlotSVL
Description

hicPlotSVL computes the ratio between short range and long range contacts per chromosome independently. Per sample one box plot is created and, if more than one sample is given, the computed ratios are assumed to be one distribution and a Wilcoxon rank-sum test under H0 ‘distributions are equal’ is computed. All used data is written to a third raw data file.

The distance to distinct short and long range contacts can be set via the –distance parameter; for short range the sum for all contacts smaller or equal this distance are computed, for long range contacts all contacts greater this distance.

Usage example
$ hicPlotSVL -m hmec_10kb.cool nhek_10kb.cool hmec_10kb.cool --distance 2000000 --threads 4 --plotFileName plot.png --outFileName pvalues.txt --outFileNameData rawData.txt

This results in three files:

The raw data containing the sums for short range, long range and ratio per sample and chromosome.

# Created with HiCExplorer's hicPlotSVL 3.3
# Short range vs long range contacts per chromosome: raw data
# Short range contacts: <= 2000000
#   hmec_10kb.cool                  nhek_10kb.cool                  hmec_10kb.cool
# Chromosome        Ratio   Sum <= 2000000  Sum > 2000000   Ratio   Sum <= 2000000  Sum > 2000000   Ratio   Sum <= 2000000  Sum > 2000000
1   3.0399769346724543      33476834        11012200        2.79740105237572        44262902        15822866        3.0399769346724543      33476834        11012200
2   2.7532203542810625      31723954        11522490        2.5007877714355438      47468438        18981394        2.7532203542810625      31723954        11522490
3   2.922650759458664       26251027        8981924 2.6235211241878442      39640848        15109788        2.922650759458664       26251027        8981924
4   2.7235598858451637      22474680        8251950 2.5572455199457864      37486882        14659086        2.7235598858451637      22474680        8251950
5   2.9585962905193712      22716268        7678056 2.752922527526723       35445722        12875670        2.9585962905193712      22716268        7678056
6   3.168274165465025       22872690        7219290 2.8602111006131703      33990211        11883812        3.168274165465025       22872690        7219290
7   3.1093346580597516      19603416        6304698 2.8021236966788887      29712823        10603680        3.1093346580597516      19603416        6304698
8   3.135391026076832       18355087        5854162 2.7964394470859024      28660624        10248970        3.135391026076832       18355087        5854162
9   4.1147978383348125      15395763        3741560 3.819940066283481       21994046        5757694 4.1147978383348125      15395763        3741560
10  3.448063050802953       17964043        5209894 3.1116673856502253      26270171        8442474 3.448063050802953       17964043        5209894
11  3.5924666993070407      18651850        5191934 3.1364875011923035      26240350        8366158 3.5924666993070407      18651850        5191934
12  3.6817551043464416      18640866        5063038 3.306662109403207       26101554        7893626 3.6817551043464416      18640866        5063038
13  3.476204237522881       11018462        3169682 3.0976674036654805      18922281        6108558 3.476204237522881       11018462        3169682
14  3.70550850832778        11164875        3013048 3.6226817463785164      17245704        4760480 3.70550850832778        11164875        3013048
15  4.607631079612186       11165313        2423222 4.567998349104569       15273742        3343640 4.607631079612186       11165313        2423222
16  4.397874357146307       10745775        2443402 3.890983210350018       14666462        3769346 4.397874357146307       10745775        2443402
17  5.809374740402161       12168235        2094586 5.3360710927739285      14154110        2652534 5.809374740402161       12168235        2094586
18  3.7647349280938895      9339833 2480874 3.485487446356812       15019063        4309028 3.7647349280938895      9339833 2480874
19  6.492239632778196       8466283 1304062 5.774337450385819       9368978 1622520 6.492239632778196       8466283 1304062
20  5.542933774973686       8962935 1617002 4.977679877778358       12009479        2412666 5.542933774973686       8962935 1617002
21  6.665622315255486       3910374 586648  6.1843701763589225      6554715 1059884 6.665622315255486       3910374 586648
22  8.063663557923096       4992327 619114  7.433759425439728       5932928 798106  8.063663557923096       4992327 619114
X   2.208752982178897       14424173        6530460 2.3130534357407995      27628734        11944702        2.208752982178897       14424173        6530460
Y   4.165021803993573       36294   8714    3.8063291139240505      45105   11850   4.165021803993573       36294   8714
MT

The p-values between the samples:

# Created with HiCExplorer's hicPlotSVL 3.3
# Short range vs long range contacts per chromosome, p-values of each distribution against each other distribution with Wilcoxon rank-sum
# Short range contacts: <= 2000000
hmec_10kb.cool      nhek_10kb.cool  0.28362036331636575
hmec_10kb.cool      hmec_10kb.cool  1.0
nhek_10kb.cool      hmec_10kb.cool  0.28362036331636575

The box plot:

_images/plot_svl.png

Tools for TADs processing

hicFindTADs
Description

Uses a measure called TAD-separation score to identify the degree of separation between the left and right regions at each Hi-C matrix bin. This is done for a running window of different sizes. Then, TADs are called as those positions having a local TAD-separation score minimum. The TAD-separation score is measured using the z-score of the Hi-C matrix and is defined as the mean zscore of all the matrix contacts between the left and right regions (diamond).

To find the TADs, the program needs to compute first the TAD scores at different window sizes. Then, the results of that computation are used to call the TADs. This is convenient to test different filtering criteria quickly as the demanding step is the computation of TAD-separation scores.

A simple example usage is:

$ hicFindTads -m hic_matrix.h5 –outPrefix TADs –correctForMultipleTesting fdr

The bedgraph file produced by this tool can be used to plot the so-called insulation score along the genome or at specific regions. This score is much more reliable across samples than the number of TADs or the TADs width that can vary depending on the sequencing depth because of the lack of information at certain bins, and depending on the parameters used with this tool.

usage: hicFindTADs --matrix MATRIX --outPrefix OUTPREFIX
                   --correctForMultipleTesting {fdr,bonferroni,None}
                   [--minDepth INT bp] [--maxDepth INT bp] [--step INT bp]
                   [--TAD_sep_score_prefix TAD_SEP_SCORE_PREFIX]
                   [--thresholdComparisons THRESHOLDCOMPARISONS]
                   [--delta DELTA] [--minBoundaryDistance MINBOUNDARYDISTANCE]
                   [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                   [--numberOfProcessors NUMBEROFPROCESSORS] [--help]
                   [--version]
Required arguments
--matrix, -m

Corrected Hi-C matrix to use for the computations.

--outPrefix

File prefix to save the resulting files: 1. <prefix>_tad_separation.bm The format of the output file is chrom start end TAD-sep1 TAD-sep2 TAD-sep3 .. etc. We call this format a bedgraph matrix and can be plotted using hicPlotTADs. Each of the TAD-separation scores in the file corresponds to a different window length starting from –minDepth to –maxDepth. 2. <prefix>_zscore_matrix.h5, the z-score matrix used for the computation of the TAD-separation score. 3. < prefix > _boundaries.bed, which contains the positions of boundaries. The genomic coordinates in this file correspond to the resolution used. Thus, for Hi-C bins of 10.000bp the boundary position is 10.000bp long. For restriction fragment matrices the boundary position varies depending on the fragment length at the boundary. 4. <prefix>_domains.bed contains the TADs positions. This is a non-overlapping set of genomic positions. 5. <prefix>_boundaries.gff Similar to the boundaries bed file but with extra information (p-value, delta). 6. <prefix>_score.bedgraph file contains the TAD-separation score measured at each Hi-C bin coordinate. Is useful to visualize in a genome browser. The delta and p-value settings are saved as part of the name.

--correctForMultipleTesting

Possible choices: fdr, bonferroni, None

Select the bonferroni or false discovery rate for a multiple comparison. Bonferroni controls the family-wise error rate (FWER) and needs a p-value. The false discovery rate (FDR) controls the likelyhood of type I errors and needs a q-value. As a third option it is possible to not use a multiple comparison method at all.

Default: “fdr”

Optional arguments
--minDepth

Minimum window length (in bp) to be considered to the left and to the right of each Hi-C bin. This number should be at least 3 times as large as the bin size of the Hi-C matrix.

--maxDepth

Maximum window length to be considered to the left and to the right of the cut point in bp. This number should around 6-10 times as large as the bin size of the Hi-C matrix.

--step

Step size when moving from –minDepth to –maxDepth. Note, the step size grows exponentially as minDeph + (step * int(x)**1.5) for x in [0, 1, …] until it reaches maxDepth. For example, selecting step=10,000, minDepth=20,000 and maxDepth=150,000 will compute TAD-scores for window sizes: 20,000, 30,000, 40,000, 70,000 and 100,000

--TAD_sep_score_prefix

Sometimes it is useful to change some of the parameters without recomputing the z-score matrix and the TAD-separation score. For this case, the prefix containing the TAD separation score and the z-score matrix can be given. If this option is given, new boundaries will be computed but the values of –minDepth, –maxDepth and –step will not be used.

--thresholdComparisons

P-value threshold for the Bonferroni correction / q-value for FDR. The probability of a local minima to be a boundary is estimated by comparing the distribution (Wilcoxon ranksum) of the z-scores between the left and right regions (diamond) at the local minimum with the matrix z-scores for a diamond at –minDepth to the left and a diamond –minDepth to the right. If –correctForMultipleTesting is ‘None’ the threshold is applied on the raw p-values without any multiple testing correction. Set it to ‘1’ if no threshold should be used.

Default: 0.01

--delta

Minimum threshold of the difference between the TAD-separation score of a putative boundary and the mean of the TAD-sep. score of surrounding bins. The delta value reduces spurious boundaries that are shallow, which usually occur at the center of large TADs when the TAD-sep. score is flat. Higher delta threshold values produce more conservative boundary estimations. By default a value of 0.01 is used.

Default: 0.01

--minBoundaryDistance

Minimum distance between boundaries (in bp). This parameter can be used to reduce spurious boundaries caused by noise.

--chromosomes

Chromosomes and order in which the chromosomes should be plotted. This option overrides –region and –region2.

--numberOfProcessors, -p

Number of processors to use

Default: 1

--version

show program’s version number and exit

Usage example

It is recommended to test multiple parameters of TAD calling with hicFindTADs before making conclusions about the number of TADs in a given sample or before comparing TAD calling between multiple conditions. In order to compare several TAD calling parameters at once, it is recommended to use hicPlotTADs.

Below you can find a typical command-line to use hicFindTADs:

$ hicFindTADs -m myHiCmatrix.h5 \
--outPrefix myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr \
--minDepth 3000 \
--maxDepth 31500 \
--step 1500 \
--thresholdComparisons 0.05 \
--delta 0.01 \
--correctForMultipleTesting fdr \
-p 64

This command will output the following files:

myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_boundaries.bed
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_boundaries.gff
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_domains.bed
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_score.bedgraph
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_score.npz
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_tad_score.bm
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_zscore_matrix.h5

TAD boundaries locations are stored in the boundaries files, domains.bed file contains the TAD locations, score files contain TAD separation score, or the so-called TAD insulation score, in various formats. As a side note, the tad_score.bm file is a bedgraph matrix that can be used to display TAD separation score curves in hicPlotTADs for example.

The zscore_matrix.h5 file contains a z-score matrix that is useful to quickly test the –thresholdComparisons, –delta and –correctForMultipleTesting parameters by using the –TAD_sep_score_prefix option pointing to this zscore_matrix.h5 file. For example to quickly test a –thresholdComparisons of 0.01 instead of 0.05 we can run the following command:

$ hicFindTADs -m myHiCmatrix.h5 \
--outPrefix myHiCmatrix_min10000_max40000_step1500_thres0.01_delta0.01_fdr \
--TAD_sep_score_prefix myHiCmatrix_min10000_max40000_step1500_thres0.05_delta0.01_fdr
--thresholdComparisons 0.01 \
--delta 0.01 \
--correctForMultipleTesting fdr \
-p 64

As you can see above, –minDepth, –maxDepth and –step are ignored because these parameters are used to calculate the z-score matrix which is here provided to –TAD_sep_score_prefix. Since z-score matrix computation is the most demanding step of hicFindTADs in terms of memory and computation, the above command will thus run significantly faster than the previous one.

Multiple combinations of parameters can be tested that way with only one z-score matrix computation. To compare several TAD calling outputs, we use hicPlotTADs with the following command using, for example, the following tracks.ini file:

  • command line:

$ hicPlotTADs --tracks tracks.ini --region chrX:6800000-8500000  -o TAD_calling_comparison.png
  • tracks.ini:

[x-axis]
fontsize=10

[hic]
file = myHiCmatrix.h5
title = Threshold 0.05
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.05_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[hic]
file = myHiCmatrix.h5
title = Threshold 0.01
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.01_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[spacer]
height = 0.1

[hic]
file = myHiCmatrix.h5
title = Threshold 0.005
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.005_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[spacer]
height = 0.1

[hic]
file = myHiCmatrix.h5
title = Threshold 0.001
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.001_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[spacer]
height = 0.1

[bigwig]
file = /data/processing/ChIP-Seq_Embryos/H3K36me3_14c.bigwig
title = H3K36me3
color = darkred
min_value = 0
max_value = auto
height = 2
file_type = bigwig

[spacer]
height = 0.1

[genes]
file = /data/group/bedfiles/dm6/genes_sorted.bed
title = genes
color = black
height = 18
labels = true
file_type = bed

This will result in the following plot where we see that the fourth set of hicFindTADs parameters with a threshold of 0.001 gives the best results in terms of TAD calling compared to the corrected Hi-C counts distribution and compared to the enrichment of H3K36me3, which is known to be enriched at TAD boundaries in Drosophila melanogaster.

_images/hicFindTADs_TAD_calling_comparison.png
Notes

In the _domains.bed output file, the 5th column contains the TAD-separation score at the boundary located at the start of each domain.

The process to identify boundaries is as follows:

  • call all local minima in the average TAD-score. Each local minimum should be separated by at least min_boundary_distance. If this value is not given, it is set to the average bin size * 4

  • for each local minimum detected, compute its p-value and then compute a q-value.

  • for each local minimum detected, compute the ‘delta’ which is the difference between the mean TAD-score of the 10 bins before and the 10 bins after the minimum (excluding the min point)

  • keep only those minima that fulfill the following criteria: the p-value (or q-value depending on the user selection) should be below the given threshold and the delta should be above the user defined threshold

  • everything between 2 consecutive boundaries is a TAD

For computation of the p-values, the distribution of the z-scores at the ‘diamond’ above the local minimum is compared with the distribution of z-scores that are min_depth downstream using the Wilcoxon rank-sum test. Similarly, the distribution of z-scores is computed with the z-scores min_depth upstream of the local minimum. The smallest of the two p-values is assigned to the local minimum.

If min_depth is not given, this is computed as bin size * 30 (if the bins are smaller than 1000), as bin size * 10 if the bins are between 1000 and 20.000 and as bin size * 5 if the bin size is bigger than 20.000.

If max_depth is not given, this is computed as bin size * 60 (if the bins are smaller than 1000), as bin size * 40 if the bins are between 1000 and 20.000 and as bin size * 10 if the bin size is bigger than 20.000.

hicMergeDomains

hicMergeDomains takes as input multiple TAD domain files from hicFindTads. It merges TADs from different resolutions to one TAD domains file, considers protein peaks from known TAD binding sites and computes a dependency graph of the TADs.

Two TADs are considered as one if they don’t overlap at x bins given by –value; TAD borders need to match the protein peaks given by –proteinFile; a relation between two TADs is given by their overlap of area in percent, parameter –percent. The protein peaks are only considered if in one bin at least –minPeak.

An example usage is:

$ hicMergeDomains –domainFiles 10kbtad_domains.bed 50kbtad_domains.bed –proteinFile ctcf_sorted.bed –outputMergedList two_files_ctcf –outputRelationList two_files_relation_ctcf –outputTreePlotPrefix two_files_plot_ctcf –outputTreePlotFormat pdf

usage: hicMergeDomains --domainFiles DOMAINFILES [DOMAINFILES ...]
                       [--proteinFile PROTEINFILE]
                       [--minimumNumberOfPeaks MINIMUMNUMBEROFPEAKS]
                       [--value VALUE] [--percent PERCENT]
                       [--outputMergedList OUTPUTMERGEDLIST]
                       [--outputRelationList OUTPUTRELATIONLIST]
                       [--outputTreePlotPrefix OUTPUTTREEPLOTPREFIX]
                       [--outputTreePlotFormat OUTPUTTREEPLOTFORMAT] [--help]
                       [--version]
Required arguments
--domainFiles, -d

The domain files of the different resolutions is required

Optional arguments
--proteinFile, -p

In order to be able to better assess the relationship between TADs, the associated protein file (e.g. CTCF for mammals) can be included. The protein file is required in broadpeak format

--minimumNumberOfPeaks, -m

Optional parameter to adjust the number of protein peaks when adapting the resolution to the domain files. At least minimumNumberOfPeaks of unique peaks must be in a bin to considered. Otherwise the bin is treated like it has no peaks.

Default: 1

--value, -v

Determine a value by how much the boundaries of two TADs must at least differ to consider them as two separate TADs.

Default: 5000

--percent, -pe

For the relationship determination, a percentage is required from which area coverage the TADs are related to each other.For example, a relationship should be entered from 5 percent area coverage -p 0.05

Default: 0.5

--outputMergedList, -om

File name for the merged domains list

Default: “mergedDomains.bed”

--outputRelationList, -or

File name for the relationship list of the TADs.

Default: “relationList.txt”

--outputTreePlotPrefix, -ot

File name prefix for the relationship tree of the TADs

Default: “relationship_tree_

--outputTreePlotFormat, -of

File format of the relationship tree. Supported formats are listed on: https://www.graphviz.org/doc/info/output.html

Default: “pdf”

--version

show program’s version number and exit

This tool is the result of the bachelor thesis (in German) from Sarah Domogalla at the Bioinformatics lab of the Albert-Ludwigs-University Freiburg. The thesis was written in winter semester 2019/2020.

The thesis covers the issue of differently detected TADs of the same data but on different resolutions.

_images/tads_different_resolution.png

As it can be seen here, the detected TADs on the four resolutions are quite different, and have maybe a hierarchy. Compare for example the region around 53 Mb in the different resolutions, it is unclear if these are multiple small TADs as suggest by the 10kb resolution, mid-size one as seen by 50 kb or one big, as detected on 100 kb resolution. Consider the next image for an abstract representation of this problem.

_images/tad_hier.png

The abstract representation shows multiple issues: Are ID_85,ID_86 and ID_87 related in any hierarchy? Which role plays ID_75? hicMergeDomains tries to solve this by providing the parameters –value and –percentage to control how the relationship is determined and if TADs are duplicated and therefore deleted.

_images/tads_hier_abstract.png

This abstract representation of the hierarchical TADs is created for each chromosome and provides insights for the organization of the chromatin structure in the three dimensional space.

_images/tads_all_clean.png

The last image shows a cleaned TAD detection based on all four inputs.

hicDifferentialTAD

Computes differential TADs by comparing the precomputed TAD regions of the target matrix with the same regions of the control matrix. Please notice that the matrices need to have the same read coverage, this can be achieved with hicNormalize and the ‘smallest’-mode. H0 is the assumption that two regions are identical, the rejected files contain therefore the as differential considered regions.

usage: hicDifferentialTAD [--targetMatrix TARGETMATRIX]
                          [--controlMatrix CONTROLMATRIX]
                          [--tadDomains TADDOMAINS]
                          [--outFileNamePrefix OUTFILENAMEPREFIX]
                          [--pValue PVALUE]
                          [--mode {intra-TAD,left-inter-TAD,right-inter-TAD,all}]
                          [--modeReject {all,one}] [--threads THREADS]
                          [--help] [--version]
Required arguments
--targetMatrix, -tm

The matrix which was used to compute the TADs

--controlMatrix, -cm

The control matrix to test the TADs for a differential interaction pattern.

--tadDomains, -td

The TADs domain file computed by hicFindTADs.

--outFileNamePrefix, -o

Outfile name prefix to store the accepted / rejected H0 TADs.

Optional arguments
--pValue, -p

H0 is considered as ‘two regions are identical.’ i.e. all regions with a test result of <= p-value are rejected and considered as differential.

Default: 0.05

--mode, -m

Possible choices: intra-TAD, left-inter-TAD, right-inter-TAD, all

Consider only intra-TAD interactions, or additional left inter-TAD, right inter-TAD or all.

Default: “all”

--modeReject, -mr

Possible choices: all, one

All test of a mode must be rejected (all) or reject region (and accept it is differential) as soon as at least one region is having a p-value <= –pValue.

Default: “one”

--threads, -t

Number of threads to use, the parallelization is implemented per chromosome.

Default: 4

--version

show program’s version number and exit

hicDifferentialTAD computes with a treatment Hi-C file, a control Hi-C file and a precomputed TAD domains file if the detected TADs are differential between the treatment and the control sample. The TADs need to be precomputed on the treatment file with _hicFindTADs.

hicDifferentialTAD extract per TAD three regions: the intra-TAD, the left and right inter-TAD region. In the following image, the upper visualization shows a region with the detected TADs which are indicated by the black lines. The bottom shows as an example which regions are used for the differential test: the intra-TAD region is highlighted in red, the left inter-TAD in sandy-color and the right inter-TAD in blue. Between two samples a Wilcoxon rank-sum test is applied for a TAD under H0: ‘The regions are equal’. For all three regions of a TAD the rank-sum test is independently applied. The user has the choice with the two parameters ‘mode’ and ‘modeReject’ to define if a) all three regions should be considered (‘all’), or only a subset e.g. the intra-TAD or intra-TAD and left inter-TAD should be considered; and b) if all regions need to have lower p-value than the user given to reject H0 or if it is enough that at least one region is rejecting H0 to consider the region as differential.

_images/hicDifferentialTAD_example1.png
Example usage
$ hicDifferentialTAD -tm GSM2644945_Untreated-R1.100000_chr1.cool -cm GSM2644947_Auxin2days-R1.100000_chr1.cool -td untreated_R1_domains.bed -o differential -p 0.01 -t 4 -mr all

In this example data from Nora et al. “Targeted Degradation of CTCF Decouples Local Insulation of Chromosome Domains from Genomic Compartmentalization”, Cell 169.5 (2017): 930-944 is used [GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98671]

_images/hicDifferentialTAD.png

The output are two BED-similar files: the ‘_accepted.diff_tad’ and ‘_rejected.diff_tad’ file. The difference to a real BED file is a) the usage of a header starting with ‘#’, the first six columns are BED6 standard, however, there are three additional columns with the p-values of each intra-TAD, left-inter-TAD and right-inter-TAD test. The score value and name is copied from the domains.bed file which is a output of hicFindTADs.

# Created with HiCExplorer's hicDifferentialTAD version 3.5-dev
# H0 'regions are equal' H0 is rejected for all p-value smaller or equal the user given p-value threshold; i.e. regions in this file are considered as differential.
# Rejected regions with Wilcoxon rank-sum test to p-value: 0.01  with used mode: all and modeReject: all
# Chromosome        start   end     name    score   strand  p-value left-inter-TAD  p-value right-inter-TAD p-value intra-TAD
chr1        4400000 6200000 ID_0.01_1       -0.5630275      .       nan     0.42772134044376386     0.0001942482956610518
chr1        6200000 7300000 ID_0.01_2       -0.235798       .       0.3479058102348157      0.651801360704674       0.011174626122333891
chr1        7300000 9500000 ID_0.01_3       -0.44334        .       0.7619708385959597      0.9408966423531526      4.668547072331386e-06
chr1        10100000        11600000        ID_0.01_5       -0.6844015      .       0.10355136240139871     0.2022260523997077      0.006197912102328514
chr1        11600000        12900000        ID_0.01_6       -0.607587       .       0.39564035674322084     0.04934747494654432     5.069275897797787e-07
chr1        12900000        13700000        ID_0.01_7       -0.6303795      .       0.030328896271663634    0.048097680638217614    0.0023999817017426378
chr1        13700000        14900000        ID_0.01_8       -1.2063985      .       0.07605786424686575     0.5322636852613494      1.8146100883843892e-06
chr1        14900000        17100000        ID_0.01_9       -0.3406565      .       0.4982863585572014      0.692745144471043       0.002562121461829293
chr1        17100000        18100000        ID_0.01_10      -0.230354       .       0.8237750360973404      0.011911117420279341    0.00020602838341698515
chr1        18100000        19100000        ID_0.01_11      -0.5135365      .       0.007089417650089524    0.07553560212368073     0.03561337274424189
hicMergeTADbins

Uses a BED file of domains or TAD boundaries to merge the bin counts of a Hi-C matrix per TAD.

The output matrix contains the total counts per TAD and the total contacts with all other TADs.

usage: hicMergeTADbins [-h] --matrix MATRIX --domains DOMAINS --outFile
                       OUTFILE [--version]
Named Arguments
--matrix, -m

Path to Hi-C matrix to use.

--domains

Path to a bed file containing the domains.

--outFile, -o

Name for the resulting matrix file.

--version

show program’s version number and exit

Tools for Hi-C and TADs visualization

hicPlotMatrix

Creates a heatmap of a Hi-C matrix.

usage: hicPlotMatrix --matrix MATRIX --outFileName OUTFILENAME [--title TITLE]
                     [--scoreName SCORENAME] [--perChromosome]
                     [--clearMaskedBins]
                     [--chromosomeOrder CHROMOSOMEORDER [CHROMOSOMEORDER ...]]
                     [--region REGION] [--region2 REGION2] [--log1p] [--log]
                     [--colorMap COLORMAP] [--vMin VMIN] [--vMax VMAX]
                     [--dpi DPI] [--bigwig BIGWIG [BIGWIG ...]]
                     [--bigwigAdditionalVerticalAxis]
                     [--vMinBigwig VMINBIGWIG] [--vMaxBigwig VMAXBIGWIG]
                     [--flipBigwigSign]
                     [--scaleFactorBigwig SCALEFACTORBIGWIG]
                     [--fontsize FONTSIZE] [--rotationX ROTATIONX]
                     [--rotationY ROTATIONY]
                     [--increaseFigureWidth INCREASEFIGUREWIDTH]
                     [--increaseFigureHeight INCREASEFIGUREHEIGHT]
                     [--loops LOOPS] [--help] [--version]
Required arguments
--matrix, -m

Path of the Hi-C matrix to plot.

--outFileName, -out

File name to save the image.

Optional arguments
--title, -t

Plot title.

--scoreName, -s

Score name label for the heatmap legend.

--perChromosome

Instead of plotting the whole matrix, each chromosome is plotted next to the other. This parameter is not compatible with –region.

Default: False

--clearMaskedBins

If set, masked bins are removed from the matrix and the nearest bins are extended to cover the empty space instead of plotting black lines.

Default: False

--chromosomeOrder

Chromosomes and order in which the chromosomes should be plotted. This option overrides –region and –region2.

--region

Plot only this region. The format is chr:start-end The plotted region contains the main diagonal and is symmetric unless –region2 is given.

--region2

If given, then only the region defined by –region and –region2 is given. The format is the same as –region1.

--log1p

Plot the log1p of the matrix values.

Default: False

--log

Plot the MINUS log of the matrix values.

Default: False

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html

Default: “RdYlBu_r”

--vMin

Minimum score value.

--vMax

Maximum score value.

--dpi

Resolution for the image in case theoutput is a raster graphics image (e.g png, jpg).

Default: 72

--bigwig

Bigwig file to plot below the matrix. This can for example be used to visualize A/B compartments or ChIP-seq data.

--bigwigAdditionalVerticalAxis

Add an additional axis to determine the values of a bigwig file in 2D better.

Default: False

--vMinBigwig

Minimum score value for bigwig.

--vMaxBigwig

Maximum score value for bigwig

--flipBigwigSign

The sign of the bigwig values are flipped. Useful if hicPCA gives inverted values.

Default: False

--scaleFactorBigwig

Scale the values of a bigwig file by the given factor.

Default: 1.0

--fontsize

Fontsize in the plot for x and y axis.

Default: 10

--rotationX

Rotation in degrees for the labels of x axis.

Default: 0

--rotationY

Rotation in degrees for the labels of y axis.

Default: 0

--increaseFigureWidth

Plotting additional bigwig tracks can cause compression in x or y direction of the heatmap. Set this factor to a value unequal to 0 to correct this.

Default: 0.5

--increaseFigureHeight

Plotting additional bigwig tracks can cause compression in x or y direction of the heatmap. Set this factor to a value unequal to 0 to correct this.

Default: 0.5

--loops

Bedgraph file to plot detected long range contacts from hicDetectLoops.

--version

show program’s version number and exit

Details

hicplotMatrix takes a Hi-C matrix and plots the interactions of all or some chromosomes.

Examples

Here’s an example of Hi-C data from wild-type D. melanogaster embryos.

_images/hicPlotMatrix.png

This plot shows all contacts of a Hi-C matrix. Its bins were merged into 25 kb bins using hicMergeMatrixBins. Alternatively, chromosomes can be plotted separately:

_images/hicPlotMatrix_perChromosome.png
$ hicPlotMatrix -m Dmel.h5 -o hicPlotMatrix.png \
-t 'D.melanogaster (--perChromosome)' --log1p \
--clearMaskedBins --chromosomeOrder 2L 2R 3L 3R X --perChromosome
hicPlotTADs
Description

For parameter options please see the pyGenomeTracks documentation.

Usage example

The hicPlotTADs output is similar to a genome browser screenshot that besides the usual genes and score data (like bigwig or bedgraph files) also contains Hi-C data. The plot is composed of tracks that need to be specified in a configuration file. Once the track file is ready, hicPlotTADs can be used as follows:

$ hicPlotTADs --tracks tracks.ini --region chrX:99,974,316-101,359,967 \
-t 'Marks et. al. TADs on X' -o tads.pdf
_images/marks_et-al_TADs1.png
Configuration file template

The following is a template for the configuration file which is based on .ini configuration files. Each track is defined by a section header (for example [hic track]), followed by parameters specific to the section as color, title, etc. For details please see the documentation of pyGenomeTracks.

$ hicPlotTADs --tracks hic_track.ini -o hic_track.png --region chrX:2500000-3500000
[x-axis]
where = top

[hic matrix]
file = hic_data.h5
title = Hi-C data
# depth is the maximum distance plotted in bp. In Hi-C tracks
# the height of the track is calculated based on the depth such
# that the matrix does not look deformed
depth = 300000
transform = log1p
file_type = hic_matrix

[tads]
file = domains.bed
file_type = domains
border_color = black
color = none
# the tads are overlay over the hic-matrix
# the share-y options sets the y-axis to be shared
# between the Hi-C matrix and the TADs.
overlay_previous = share-y

[spacer]

[bigwig file test]
file = bigwig.bw
# height of the track in cm (optional value)
height = 4
title = ChIP-seq
min_value = 0
max_value = 30
_images/hic_track.png
hicPlotViewpoint

Plots the number of interactions around a given reference point in a region.

usage: hicPlotViewpoint --matrix MATRIX [MATRIX ...] --region REGION
                        --outFileName OUTFILENAME --referencePoint
                        REFERENCEPOINT [--chromosome CHROMOSOME]
                        [--interactionOutFileName INTERACTIONOUTFILENAME]
                        [--dpi DPI] [--version] [--help]
Required arguments
--matrix, -m

Hi-C matrix to plot.

--region

The format is chr:start-end.

--outFileName, -o

File name of the image to save.

--referencePoint, -rp

Reference point. Needs to be in the format: ‘chr:100’ for a single reference point or ‘chr:100-200’ for a reference region.

Optional arguments
--chromosome, -C

Optional parameter: Only show results for this chromosome.

--interactionOutFileName, -i

Optional parameter: If set, a bedgraph file with all interaction will be created.

--dpi

Optional parameter: Resolution for the image in case theouput is a raster graphics image (e.g png, jpg).

Default: 300

--version

show program’s version number and exit

hicAggregateContacts
Background

hicAggregateContacts is a tool that allows plotting of aggregated Hi-C sub-matrices of a specified list of positions. Positions of interest can for example be binding sites of a specific protein that were determined by ChIP-seq or genetic elements as transcription start sites of active genes.

Description

Takes a list of positions in the Hi-C matrix and makes a pooled image.

usage: hicAggregateContacts --matrix MATRIX --outFileName OUTFILENAME --BED
                            BED --range RANGE [--BED2 BED2]
                            [--numberOfBins NUMBEROFBINS]
                            [--transform {total-counts,z-score,obs/exp,none}]
                            [--avgType {mean,median}] [--help] [--version]
                            [--outFilePrefixMatrix OUTFILEPREFIXMATRIX]
                            [--outFileContactPairs OUTFILECONTACTPAIRS]
                            [--diagnosticHeatmapFile DIAGNOSTICHEATMAPFILE]
                            [--row_wise] [--kmeans KMEANS] [--hclust HCLUST]
                            [--howToCluster {full,center,diagonal}]
                            [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                            [--colorMap COLORMAP] [--plotType {2d,3d}]
                            [--vMin VMIN] [--vMax VMAX] [--dpi DPI]
Required arguments
--matrix, -m

Path of the Hi-C matrix to plot.

--outFileName, -out

File name to save the image.

--BED

Interactions between regions in this BED file are plotted.

--range

Range of contacts that will be considered for plotting the aggregate contacts in bp with the format low_range:high_range for example 1000000:20000000. The range should start at contacts larger than TAD size to reduce background interactions.

Optional arguments
--BED2

Optional second BED file. Interactions between regions in first and second BED file are plotted.

--numberOfBins

Number of bins to include in the submatrix. The bed regions will be centered between half number of bins and the other half number of bins.

Default: 51

--transform

Possible choices: total-counts, z-score, obs/exp, none

Type of transformation for the matrix. The options are “none”, “total-counts”, “z-score” and “obs/exp”. If total counts are selected, the sub-matrix values are divided by the total counts for normalization. If z-score or obs/exp are selected, the Hi-C matrix is converted into a z-score or observed / expected matrix.

Default: “none”

--avgType

Possible choices: mean, median

Type of average used in the output matrix. Options are mean and median. Default is median.

Default: “median”

--version

show program’s version number and exit

--dpi

Optional parameter: Resolution for the image in case theoutput is a raster graphics image (e.g png, jpg).

Default: 300

Output options
--outFilePrefixMatrix

If this option is set, the values underlying the output matrix will be saved to tab-delimited tables (one per chromosome) using the indicated prefix, for example TSS_to_TSS_chrX.tab. If clustering is performed, the values are saved including the cluster_id in the file TSS_to_TSS_chrX_cluster_1.tab.

--outFileContactPairs

Output file prefix. If this option is set, the position of the contact positions are saved as (chrom1, start1, end1, chrom2, start2, end2) where chrom_n, start_n, end_n correspond to the pair of positions used to compute the submatrix. The data is saved per chromosome and per cluster separately (one file each). The format is {prefix}_{chrom}_{cluster_id}.tab. If no clusters were computed, then only one file per chromosome is produced.

--diagnosticHeatmapFile

If given, a heatmap (per chromosome) is saved. Each row in the heatmap contains thediagonal of each of the submatrices centered on the bed file. This file is useful to get an idea of the values that are used for the aggregate matrix and to determine the fraction of sub-matrices that are aggregated that may have an enrichment at the center.

--row_wise

If given,the insteractions between each row of the BED file and its corresponding row of the BED2 file are computed.

Default: False

Clustering options
--kmeans

Number of clusters to compute. When this option is set, the submatrices are split into clusters (per chromosome) using the k-means algorithm.

--hclust

Number of clusters to compute (per chromosome). When this option is set, the matrix is split into clusters using the hierarchical clustering algorithm, using “ward linkage”. –hclust could be very slow if you have >1000 submatrices per chromosome. In those cases, you might prefer –kmeans

--howToCluster

Possible choices: full, center, diagonal

Options are “full”, “center” and “diagonal”. The full clustering is the default and takes all values of each submatrix for clustering. “center”, takes only a square of length 3x3 from each submatrix and uses only this values for clustering. With the “diagonal” option the clustering is only carried out based on the submatrix diagonal (representing values at the same distance to each other.)

Default: “full”

Plotting options
--chromosomes, -C

List of chromosomes to plot.

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html

Default: “RdYlBu_r”

--plotType

Possible choices: 2d, 3d

Plot type.

Default: “2d”

--vMin

Minimum value of the plotted score.

--vMax

Maximum value of the plotted score.

Usage example

Below, you can find an example of an aggregate Hi-C matrix obtained from Drosophila melanogaster Hi-C data. The interactions are plotted at binding sites of a protein that were determined by ChIP-seq. We plot sub-matrices of 30 bins (1.5 kb bin size, 45 kb in total). The regions specified in the BED file will be centered between half number of bins and the other half number of bins.The considered range is 300-1000 kb. The range should be adjusted and only contain contacts larger than TAD size to reduce background interactions.

$ hicAggregateContacts --matrix Dmel.h5 --BED ChIP-seq-peaks.bed \
--outFileName Dmel_aggregate_Contacts --vMin 0.8 --vMax 2.2 \
--range 300000:1000000 --numberOfBins 30 --chromosomes X \
--avgType mean --transform obs/exp
_images/hicAggregateContacts.png

This example was calculated using mean interactions of an observed vs expected transformed Hi-C matrix. Additional options for the matrix transformation are total-counts or z-score. Aggregate contacts can be plotted in 2D or 3D.

hicPlotAverageRegions

hicPlotAverage regions plots the data computed by hicAverageRegions. It shows the summed up and averaged regions around all given reference points. This tool is useful to plot differences at certain reference points as for example TAD boundaries between samples.

usage: hicPlotAverageRegions --matrix MATRIX --outputFile OUTPUTFILE [--log1p]
                             [--log] [--colorMap COLORMAP] [--vMin VMIN]
                             [--vMax VMAX] [--dpi DPI] [--help] [--version]
Required arguments
--matrix, -m

The averaged regions file computed by hicAverageRegions (npz file).

--outputFile, -o

The averaged regions plot.

Optional arguments
--log1p

Plot log1p of the matrix values.

Default: False

--log

Plot log of the matrix values.

Default: False

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html

Default: “hot_r”

--vMin

Minimum score value.

--vMax

Maximum score value.

--dpi

Resolution of image ifouput is a raster graphics image (e.g png, jpg).

Default: 300

--version

show program’s version number and exit

See hicAverageRegions for an example usage.

Hi-C contact matrix handling

hicConvertFormat

Conversion of Hi-C matrices of different file formats. We support the conversion of hic to cool format via hic2cool, and homer, HicPro, h5 and cool format to h5, cool, homer or ginteractions format. Moreover, hicConvertFormat accepts multiple input files from one format with different resolutions and creates a mcool file. Each original file is stored under the path e.g. ::/resolutions/10000. A batch computation is possible, the number of input files and output files needs to match, all input files need to be of the same format type and all output files too. For input and output of cooler files special options are available, for all other formats they will be ignored.HiCPro file format needs an additional bed file as input.

usage: hicConvertFormat --matrices MATRICES [MATRICES ...] --outFileName
                        OUTFILENAME [OUTFILENAME ...] --inputFormat
                        {h5,cool,hic,homer,hicpro} --outputFormat
                        {cool,h5,homer,ginteractions,mcool}
                        [--correction_name CORRECTION_NAME]
                        [--correction_division] [--store_applied_correction]
                        [--chromosome CHROMOSOME] [--enforce_integer]
                        [--load_raw_values]
                        [--resolutions RESOLUTIONS [RESOLUTIONS ...]] [--help]
                        [--version]
                        [--bedFileHicpro BEDFILEHICPRO [BEDFILEHICPRO ...]]
Required arguments
--matrices, -m

input file(s). Could be one or many files.

--outFileName, -o

File name to save the exported matrix.

--inputFormat

Possible choices: h5, cool, hic, homer, hicpro

File format of the input matrix file. The following options are available: h5 (native HiCExplorer format based on hdf5 storage format), cool, hic, homer, hicpro.

--outputFormat

Possible choices: cool, h5, homer, ginteractions, mcool

Output format. The following options are available: h5 (native HiCExplorer format based on hdf5 storage format). cool and ginteractions.

Default: “cool”

Optional arguments
--correction_name

Name of the column which stores the correction factors. The information about the column names can be figured out with the tool hicInfo. Option only for cool input files.

Default: “weight”

--correction_division

If set, division is applied for correction. Default is a multiplication. Option only for cool input files.

Default: False

--store_applied_correction

Store the applied correction and do not set correction factors. Option only for cool input files.

Default: False

--chromosome

Load only one chromosome. Option only for cool input files.

--enforce_integer

Enforce datatype of counts to integer. Option only for cool input files.

Default: False

--load_raw_values

Load only ‘count’ data and do not apply a correction. Option only for cool input files.

Default: False

--resolutions, -r

List of resolutions that should be added.

--version

show program’s version number and exit

--bedFileHicpro, -bf

Bed file(s) of hicpro file format.

Background

To reproduce analyses and to compare and use different Hi-C analysis software, conversion between interaction matrix file formats is crucial. However, most Hi-C softwares are only supporting their own data format which makes the exchange difficult. HiCExplorer supports a range of interaction matrices, both for import and for export.

Import:
  • hic

  • cool

  • h5

  • homer

  • HicPro

Export:
  • cool

  • mcool

  • h5

  • homer

  • ginteractions

With HiCExplorer version 2.2 hicConvertFormat and hicAdjustMatrix replace hicExport from HiCExplorer 2.1 and older versions.

Usage example
hic2cool

HiCExplorer uses the library hic2cool to convert .hic interaction matrix files to the cool format. Usually .hic files have the three correction factors KR, VC or VC_SQRT; however these cannot be applied natively by HiCExplorer tools because HiCExplorer expects the correction values to be stored in the column weight. To work with corrected data the correction factors need to applied separately, see section cool to cool.

The following example will convert a hic file which contains the resolution of 1000 to a cool file with 10kb resolution. The desired resolution needs to be existing in the hic file. If no resolution parameter is defined, a mcool file with all available resolutions is created.

$ hicConvertFormat -m matrix.hic --inputFormat hic --outputFormat cool -o matrix.cool --resolutions 10000

It is only possible to convert from hic to cool format, no other formats are supported.

cool to cool

The cool file format is developed and maintained by the Mirny lab and allows to access interaction matrices in an easy to use data format.

The cool data format allows to use the following options:

  • correction_name: In case correction factors are not stored in ‘weight’ the correct column name can be defined using this parameter and the resulting matrix will store the values in ‘weight’.

  • correction_division: Correction factors can be applied by a multiplication or a division. The default behaviour is to use the multiplication, in case the correction factors are inverted, set this parameter.

  • store_applied_correction: Set this parameter if correction factors should be applied on the data and should be written back to colum ‘counts’ in the corrected form and not as raw. Default: not set.

  • chromosomes: Define a list of chromosomes which should be included in the output matrix. All chromosomes which are not defined are not part of the new matrix. This parameter can speed up the processing especially if only one chromosome is used.

  • enforce_integer: Raw interaction data is stored as integers, after the correction is applied the data is a float. Set a this parameter to enforce integer values in the new matrix.

  • load_raw_values: Set this parameter if the interaction data should not be loaded with the correction factors.

Example usage
$ hicConvertFormat -m matrix.cool --inputFormat cool --outputFormat cool -o matrix.cool --correction_name KR
Homer

Homer is a software for ‘motif discovery and next generation sequencing analysis’ and supports Hi-C. HiCExplorer is able to read and write the interaction matrix from Homer. Homer stores the interaction matrix in a simple text file as a dense matrix. To write large matrices in Homer format needs a lot of space and can take a few ours to days.

Example usage
$ hicConvertFormat -m matrix.homer --inputFormat homer --outputFormat cool -o matrix.cool
Hic-Pro

HiC-Pro file format needs an additional bed file as input:

Example usage
$ hicConvertFormat -m matrix.hicpro --bedFileHicpro hicpro.bed --inputFormat hicpro --outputFormat cool -o matrix.cool
Create a mcool file

With HiCExplorer it is possible to create a multiple cool (mcool) file. These mcool files can be used e.g. with HiGlass.

To create an mcool file, use as input either one matrix in one of the supported read formats and define the desired resolutions or define multiple input matrices. In the second case, the matrices should all have different resolutions.

Example usage

The resolutions need to be a multiple of the input matrix i.e. matrix with 10kb, 20kb and 30kb are possible but not 35kb.

$ hicConvertFormat -m matrix.cool --inputFormat cool --outputFormat mcool
   -o multi_matrix.mcool --resolutions 20000 40000 70000 120000 500000
$ hicConvertFormat -m matrix10kb.cool matrix20kb.cool matrix30kb.cool
    --inputFormat cool --outputFormat mcool -o multi_matrix.mcool

The mcool matrix contains the individual matrices as follows:

multi_matrix.mcool::/resolutions/10000
multi_matrix.mcool::/resolutions/40000
multi_matrix.mcool::/resolutions/70000
multi_matrix.mcool::/resolutions/120000
multi_matrix.mcool::/resolutions/500000
hicAdjustMatrix

This tool adjusts hic matrices by keeping, removing or masking a given list of regions or chromosmes.

usage: hicAdjustMatrix --matrix MATRIX --outFileName OUTFILENAME
                       [--chromosomes CHROMOSOMES [CHROMOSOMES ...] |
                       --regions REGIONS | --maskBadRegions MASKBADREGIONS]
                       [--action {keep,remove,mask}] [--help] [--version]
Named Arguments
--chromosomes, -c

List of chromosomes to keep/remove.

--regions, -r

BED file which stores a list of regions to keep/remove.

--maskBadRegions, -mbr

Bad regions are identified and masked.

Required arguments
--matrix, -m

The Hi-C matrix to adjust. HiCExplorer supports the following file formats: h5 (native HiCExplorer format) and cool.

--outFileName, -o

File name to save the adjusted matrix.

Optional arguments
--action, -a

Possible choices: keep, remove, mask

Keep, remove or mask the list of specified chromosomes/regions. keep/remove: These options keep/remove bins of matrix by deleting them. This may cause issue plotting the matrix if several parts of a single chromosome are going to be deleted. In that case, one may consider using the mask option.

Default: “keep”

--version

show program’s version number and exit

hicAdjustMatrix can mask, remove or keep defined regions from a BED file or given chromosomes. This can be useful to create a smaller Hi-C matrix of a single chromosome for e.g. testing purposes or to remove repetitive chromosome ends before calculating A/B compartments using hicPCA.

Example usages
$ hicAdjustMatrix -m matrix.cool --action keep --chromosomes chr1 -o matrix_chr1.cool
$ hicAdjustMatrix -m matrix.cool --action mask --regions mask_regions.bed -o matrix_masked.cool

mask_regions.bed

chr1    10  30
chr1    50  300

Capture Hi-C analysis

chicQualityControl

Computes the sparsity of each viewpoint to determine the quality. A viewpoint is considered to be of bad quality if it is too sparse i.e. if there are too many locations with no interactions recorded.

This script creates three output files: a plot with the sparsity distribution per matrix, a plot with the sparsity distribution as histograms and a filtered reference points file.

An example usage is:

$ chicQualityControl -m matrix1.h5 matrix2.h5 -rp referencePointsFile.bed –range 20000 40000 –sparsity 0.01 -o referencePointFile_QC_passed.bed

usage: chicQualityControl --matrices MATRICES [MATRICES ...] --referencePoints
                          REFERENCEPOINTS --sparsity SPARSITY
                          [--outFileName OUTFILENAME]
                          [--outFileNameHistogram OUTFILENAMEHISTOGRAM]
                          [--outFileNameSparsity OUTFILENAMESPARSITY]
                          [--threads THREADS] [--fixateRange FIXATERANGE]
                          [--dpi DPI] [--help] [--version]
Required arguments
--matrices, -m

The input matrices to apply the QC on.

--referencePoints, -rp

Bed file contains all reference points which are checked for a sufficient number of interactions.

--sparsity, -s

Viewpoints with a sparsity less than the value given are considered of bad quality. If multiple matrices are given, the viewpoint is removed as soon as it is of bad quality in at least one matrix.

Optional arguments
--outFileName, -o

The output file name of the passed reference points. Used as prefix for the plots as well.

Default: “new_referencepoints.bed”

--outFileNameHistogram, -oh

The output file for the histogram plot.

Default: “histogram.png”

--outFileNameSparsity, -os

The output file for the sparsity distribution plot.

Default: “sparsity.png”

--threads, -t

Number of threads.

Default: 4

--fixateRange, -fs

Fixate score of background model starting at distance x. E.g. all values greater than 500kb are set to the value of the 500kb bin.

Default: 500000

--dpi

Optional parameter: Resolution for the image if theoutput is a raster graphics image (e.g png, jpg)

Default: 300

--version

show program’s version number and exit

chicViewpointBackgroundModel

chicViewpointBackgroundModel computes a background model for all given samples with all reference points. For all relative distances to a reference point a negative binomial distribution is fitted. In addition, for each relative distance to a reference point the average value for this location is computed. Both background models are used, the first one for p-value and significance computation, the second one to filter out interactions with a smaller x-fold over the mean.

The background distributions are fixed at –fixateRange, i.e. all distances lower or higher than this value use the fixed background distribution.

An example usage is:

$ chicViewpointBackgroundModel –matrices matrix1.cool matrix2.cool matrix3.cool –referencePoints referencePointsFile.bed –range 20000 40000 –outFileName background_model.bed

usage: chicViewpointBackgroundModel --matrices MATRICES [MATRICES ...]
                                    --referencePoints REFERENCEPOINTS
                                    [--averageContactBin AVERAGECONTACTBIN]
                                    [--truncateZeros]
                                    [--outFileName OUTFILENAME]
                                    [--threads THREADS]
                                    [--fixateRange FIXATERANGE] [--help]
                                    [--version]
Required arguments
--matrices, -m

The input matrices (samples) to build the background model on.

--referencePoints, -rp

Bed file contains all reference points which should be used to build the background model.

Optional arguments
--averageContactBin

Average the contacts of n bins via a sliding window approach.

Default: 5

--truncateZeros, -tz

Truncates the zeros before the distributions are fitted. Use it in case you observe an over dispersion.

Default: False

--outFileName, -o

The name of the background model file

Default: “background_model.txt”

--threads, -t

Number of threads (uses the python multiprocessing module).

Default: 4

--fixateRange, -fs

Fixate score of backgroundmodel starting at distance x. E.g. all values greater 500kb are set to the value of the 500kb bin.

Default: 500000

--version

show program’s version number and exit

chicViewpoint

Computes per input matrix all viewpoints which are defined in the reference points file. All files are stored in the folder defined by –outputFolder, and the files are named by the name of the reference point, the sample name and the location of the reference point:

gene_matrix_name_chr_start_end.txt

If multiple reference points are used and the processing downstream should be automated via batch processing mode, please activate –writeFileNamesToFile. All the file names will be written to this file; in the case of multiple samples two consecutive lines are considered as treatment vs. control for the differential analysis.

An example usage is:

$ chicViewpoint –matrices matrix1.cool matrix2.cool matrix3.cool –referencePoints referencePointsFile.txt –range 20000 40000 –outputFolder interactionFilesFolder

An example usage for batch mode is:

$ chicViewpoint –matrices matrix1.cool matrix2.cool matrix3.cool –referencePoints referencePointsFile.txt –range 20000 40000 –outputFolder interactionFilesFolder –writeFileNamesToFile interactionFile.txt

usage: chicViewpoint --matrices MATRICES [MATRICES ...] --range RANGE RANGE
                     --referencePoints REFERENCEPOINTS --backgroundModelFile
                     BACKGROUNDMODELFILE [--threads THREADS]
                     [--averageContactBin AVERAGECONTACTBIN]
                     [--decimalPlaces DECIMALPLACES]
                     [--writeFileNamesToFile WRITEFILENAMESTOFILE]
                     [--fixateRange FIXATERANGE] [--allViewpointsList]
                     [--outputFolder OUTPUTFOLDER] [--help] [--version]
Required arguments
--matrices, -m

Path to the Hi-C matrices which store the captured Hi-C data per sample.

--range

Defines the region upstream and downstream of a reference point which should be considered in the analysis. Please remember to use the same fixate range setting as for the background model computation and that distances of the range larger than the fixate range use the background model of those.Format is –region upstream downstream

--referencePoints, -rp

Reference point file. Needs to be in the format: ‘chr 100’ for a single reference point or ‘chr 100 200’ for a reference region and with a single reference point per line

--backgroundModelFile, -bmf

path to the background file computed by chicViewpointBackgroundModel

Optional arguments
--threads, -t

Number of threads (uses the python multiprocessing module).

Default: 4

--averageContactBin

Average the contacts of n bins via a sliding window approach to smooth the values and be less sensitive for outliers.

Default: 5

--decimalPlaces

Decimal places for all output floating numbers in the viewpoint files.

Default: 12

--writeFileNamesToFile, -w

Set this parameter to have a file with all file names of the viewpoint files (useful only in batch processing mode).

--fixateRange, -fs

Fixate range of background model starting at distance x. E.g. all values greater 500kb are set to the value of the 500kb bin.

Default: 500000

--allViewpointsList, -avl

Writes a file where all viewpoints all samples are sorted by the viewpoints.

Default: False

--outputFolder, -o

This folder contains all created viewpoint files.

Default: “interactionFiles”

--version

show program’s version number and exit

chicSignificantInteractions

Per viewpoint the significant interactions are detected based on the background model. For each viewpoint file, an output file is created with all recorded significant interactions and a target file. The target file is especially useful in the batch mode context; for two consecutive listed control and treatment viewpoints it merges the significant interactions which can then be used to test for a differential interaction scheme.

chicSignificantInteractions supports two modes to detect significant interactions, either by an x-fold over the average background or a loose p-value. In both cases neighboring significant peaks are merged together and an additional p-value is computed based on the sum of interactions for this neighborhood. Only interactions with a higher p-value (as specified by the threshold –pValue) are accepted as a significant interaction.

An example usage is for single mode is:

$ chicSignificantInteractions –interactionFile interactionFilesFolder/Sox17_FL-E13-5_chr1_1000_2000.bed –referencePoints referencePointsFile.bed –range 20000 40000 –backgroundModelFile background_model.bed –loosePValue 0.5 –pValue 0.01

An example usage is for batch mode is:

$ chicViewpointBackgroundModel –matrices matrix1.cool matrix2.cool matrix3.cool –referencePoints referencePointsFile.bed –range 20000 40000 –outFileName background_model.bed

The main difference between single mode and batch mode is that in single mode the parameter –interactionFile is interpreted as a list of viewpoint files created with chicViewpoint, whereas in batch mode only one file is allowed, containing the file names of viewpoint files (one per line). This file is created by chicViewpoint and the parameter –writeFileNamesToFile. In batch mode, please remember to specify the folder (via –interactionFileFolder) where chicViewpoint wrote the files.

usage: chicSignificantInteractions --interactionFile INTERACTIONFILE
                                   [INTERACTIONFILE ...] --pValue PVALUE
                                   [--xFoldBackground XFOLDBACKGROUND | --loosePValue LOOSEPVALUE]
                                   --backgroundModelFile BACKGROUNDMODELFILE
                                   --range RANGE RANGE
                                   [--outFileNameSuffix OUTFILENAMESUFFIX]
                                   [--interactionFileFolder INTERACTIONFILEFOLDER]
                                   [--targetFolder TARGETFOLDER]
                                   [--outputFolder OUTPUTFOLDER]
                                   [--writeFileNamesToFile WRITEFILENAMESTOFILE]
                                   [--targetFileList TARGETFILELIST]
                                   [--batchMode] [--threads THREADS]
                                   [--truncateZeroPvalues]
                                   [--fixateRange FIXATERANGE]
                                   [--peakInteractionsThreshold PEAKINTERACTIONSTHRESHOLD]
                                   [--resolution RESOLUTION]
                                   [--computeSampleNumber COMPUTESAMPLENUMBER]
                                   [--multipleTesting {fdr,bonferroni,None}]
                                   [--thresholdMultipleTesting THRESHOLDMULTIPLETESTING]
                                   [--help] [--version]
Named Arguments
--xFoldBackground, -xf

Filter x-fold over background. Used to merge neighboring bins with a broader peak but less significant interactions to a single peak with high significance. Used only for pValue option.

--loosePValue, -lp

loose p-value threshold to filter target regions in a first round. Used to merge neighboring bins with a broader peak but less significant interactions to a single peak with high significance. Used only for pValue option.

Required arguments
--interactionFile, -if

path to the interaction files which should be used for aggregation of the statistics.

--pValue, -p

p-value threshold to filter target regions for inclusion in differential analysis.

--backgroundModelFile, -bmf

path to the background file which is necessary to compute the rbz-score

--range

Defines the region upstream and downstream of a reference point which should be included. Format is –region upstream downstream, e.g. –region 500000 500000 plots 500kb up- and 500kb downstream. This value should not exceed the range used in the other chic-tools.

Optional arguments
--outFileNameSuffix, -suffix

File name suffix to save the results; prefix is the input file name.

Default: “_significant_interactions.txt”

--interactionFileFolder, -iff

Folder where the interaction files are stored. Applies only for batch mode.

Default: “.”

--targetFolder, -tf

Folder where the target files are stored.

Default: “targetFolder”

--outputFolder, -o

Output folder of the significant interaction files.

Default: “significantFiles”

--writeFileNamesToFile, -w

Default: “significantFilesBatch.txt”

--targetFileList, -tl

The file to store the target file names.

Default: “targetList.txt”

--batchMode, -bm

Turn on batch mode. The given file for –interactionFile and or –targetFile contain a list of the to be processed files.

Default: False

--threads, -t

Number of threads (uses the python multiprocessing module).

Default: 4

--truncateZeroPvalues, -tzpv

Sets all p-values which are equal to zero to one. This has the effect that the associated positions are not part of the significance decision.

Default: False

--fixateRange, -fs

Fixate range of backgroundmodel starting at distance x. E.g. all values greater than 500kb are set to the value of the 500kb bin.

Default: 500000

--peakInteractionsThreshold, -pit

The minimum number of interactions a detected peak needs to have to be considered.

Default: 5

--resolution, -r

Resolution of the bin in genomic units. Values are set as number of bases, e.g. 1000 for a 1kb, 5000 for a 5kb or 10000 for a 10kb resolution.This value is used to merge neighboring bins.

Default: 1000

--computeSampleNumber, -csn

Number of samples to compute together. Applies only in batch mode.

Default: 2

--multipleTesting, -mt

Possible choices: fdr, bonferroni, None

Multiple testing correction per relative distance with Bonferroni or FDR.

Default: “None”

--thresholdMultipleTesting, -tmt

Threshold for Bonferroni / FDR. Either a float value for all or a file with one threshold per relative distance.

--version

show program’s version number and exit

chicAggregateStatistic

chicAggregateStatistic is a preprocessing tool for chicDifferentialTest. It takes two consecutive viewpoint files and one target file and creates one file containing all locations which should be tested for differential interactions. Either one target file for two consecutive viewpoint files or one target file for all viewpoints is accepted.

An example usage is:

$ chicAggregateStatistic –interactionFile viewpoint1.txt viewpoint2.txt –targetFile targets.txt –outFileNameSuffix aggregated.txt

which will create a single output file: viewpoint1_viewpoint2_aggregated.txt

A second mode is the batch processing mode. For this you need a file containing the names of the viewpoint files (generated by chicViewpoint via –writeFileNamesToFile), a folder which contains the files, a target list file containing the name of all target files and a folder which contains the target files (created by chicSignificantInteractions):

$ chicAggregateStatistic –interactionFile viewpoint_names.txt –targetFile target_names.txt –interactionFileFolder viewpointFilesFolder –targetFileFolder targetFolder –batchMode –threads 20 –outFileNameSuffix aggregated.bed

If the –targetFileFolder flag is not set in batch mode, it is assumed the –targetFile should be used for all viewpoints.

usage: chicAggregateStatistic --interactionFile INTERACTIONFILE
                              [INTERACTIONFILE ...]
                              [--targetFile TARGETFILE [TARGETFILE ...]]
                              [--outFileNameSuffix OUTFILENAMESUFFIX]
                              [--interactionFileFolder INTERACTIONFILEFOLDER]
                              [--targetFileFolder TARGETFILEFOLDER]
                              [--outputFolder OUTPUTFOLDER]
                              [--writeFileNamesToFile WRITEFILENAMESTOFILE]
                              [--batchMode] [--threads THREADS] [--help]
                              [--version]
Required arguments
--interactionFile, -if

path to the interaction files which should be used for aggregation of the statistics.

--targetFile, -tf

path to the target files which contains the target regions to prepare data for differential analysis.

Optional arguments
--outFileNameSuffix, -suffix

File name suffix to save the result.

Default: “_aggregate_target.txt”

--interactionFileFolder, -iff

Folder where the interaction files are stored. Applies only for batch mode.

Default: “.”

--targetFileFolder, -tff

Folder where the target files are stored. Applies only for batch mode.

--outputFolder, -o

Output folder containing the files.

Default: “aggregatedFiles”

--writeFileNamesToFile, -w

Default: “aggregatedFilesBatch.txt”

--batchMode, -bm

turns on batch mode. The files provided by –interactionFile and/or –targetFile contain a list of the files to be processed.

Default: False

--threads, -t

Number of threads (uses the python multiprocessing module).

Default: 4

--version

show program’s version number and exit

chicDifferentialTest

chicDifferentialTest tests if two locations under consideration of the reference point have a different interaction count. For this either Fisher’s test or the chi2 contingency test can be used. The files that are accepted for this test can be created with chicAggregateStatistic. H0 assumes the interactions are not different. Therefore the differential interaction counts are all where H0 was rejected.

An example usage is:

$ chicDifferentialTest –interactionFile viewpoint1_aggregated.txt viewpoint2_aggregated.txt –alpha 0.05 –statisticTest fisher –outputFolder differentialResults

and this will create three files: viewpoint1_viewpoint2_aggregated_H0_accepted.txt, viewpoint1_viewpoint2_aggregated_H0_rejected.txt, viewpoint1_viewpoint2_aggregated_results.txt

The first file contains all locations where H0 was accepted, the second file all locations where H0 was rejected and the third one all locations with the test result.

A second mode is the batch processing mode. For this you need a file containing the names of the aggregated files (generated by chicAggregateStatistic via –writeFileNamesToFile and the batch mode):

$ chicDifferentialTest –statisticTest fisher –alpha 0.05 –interactionFile aggregatedFilesBatch.txt –interactionFileFolder aggregatedFilesFolder –batchMode –threads 20 –outputFolder differentialResults

This will create, as in the non-batch mode, three files per aggregated file and writes the file name to the file given by –rejectedFileNamesToFile. This last file can be used to plot the differential interactions per viewpoint in batch mode, using chicPlotViewpoint.

usage: chicDifferentialTest --interactionFile INTERACTIONFILE
                            [INTERACTIONFILE ...] --alpha ALPHA
                            [--interactionFileFolder INTERACTIONFILEFOLDER]
                            [--outputFolder OUTPUTFOLDER]
                            [--statisticTest {fisher,chi2}] [--batchMode]
                            [--threads THREADS]
                            [--rejectedFileNamesToFile REJECTEDFILENAMESTOFILE]
                            [--help] [--version]
Required arguments
--interactionFile, -if

path to the interaction files which should be used for the differential test.

--alpha, -a

define a significance level (alpha) for accepting samples

Default: 0.05

Optional arguments
--interactionFileFolder, -iff

Folder where the interaction files are stored. Applies only for batch mode.

Default: “.”

--outputFolder, -o

Output folder of the files.

Default: “differentialResults”

--statisticTest

Possible choices: fisher, chi2

Type of test used: fisher’s exact test or chi2 contingency

Default: “fisher”

--batchMode, -bm

turn on batch mode. The given file for –interactionFile and or –targetFile contain a list of the to be processed files.

Default: False

--threads, -t

Number of threads (uses the python multiprocessing module)

Default: 4

--rejectedFileNamesToFile, -r

Writes the names of the rejected H0 (therefore containing the differential interactions) to file. Can be used for batch processing mode of chicPlotViewpoint.

Default: “rejected_H0.txt”

--version

show program’s version number and exit

chicPlotViewpoint

chicPlotViewpoint plots one or many viewpoints with the average background model and the computed p-value per sample. In addition, it can highlight differential interactions of two samples and/or significant regions.

An example usage is:

$ chicPlotViewpoint –interactionFile viewpoint1.txt viewpoint2.txt –range 500000 500000 –backgroundModelFile background_model.txt –pValue –outFileName viewpoint1_2.png –dpi 300

In batch mode the list of file names and the folders containing the files need to be given:

$ chicPlotViewpoint –interactionFile viewpoint_names.txt -interactionFileFolder viewpointFilesFolder –differentialTestResult rejected_H0.txt –differentialTestResultsFolder differentialFolder –range 500000 500000 –backgroundModelFile background_model.txt –pValue –outputFolder plotsFOlder –dpi 300 –threads 20

usage: chicPlotViewpoint --interactionFile INTERACTIONFILE
                         [INTERACTIONFILE ...] --range RANGE RANGE
                         [--backgroundModelFile BACKGROUNDMODELFILE]
                         [--interactionFileFolder INTERACTIONFILEFOLDER]
                         [--differentialTestResult DIFFERENTIALTESTRESULT [DIFFERENTIALTESTRESULT ...]]
                         [--significantInteractionFileFolder SIGNIFICANTINTERACTIONFILEFOLDER]
                         [--differentialTestResultsFolder DIFFERENTIALTESTRESULTSFOLDER]
                         [--significantInteractions SIGNIFICANTINTERACTIONS [SIGNIFICANTINTERACTIONS ...]]
                         [--plotSignificantInteractions]
                         [--outputFolder OUTPUTFOLDER]
                         [--outputFormat OUTPUTFORMAT] [--dpi DPI]
                         [--binResolution BINRESOLUTION]
                         [--colorMapPvalue COLORMAPPVALUE]
                         [--maxPValue MAXPVALUE] [--minPValue MINPVALUE]
                         [--pValue]
                         [--pValueSignificanceLevels PVALUESIGNIFICANCELEVELS [PVALUESIGNIFICANCELEVELS ...]]
                         [--xFold XFOLD] [--truncateZeroPvalues]
                         [--outFileName OUTFILENAME] [--batchMode]
                         [--plotSampleNumber PLOTSAMPLENUMBER]
                         [--colorList COLORLIST [COLORLIST ...]]
                         [--threads THREADS] [--help] [--version]
Required arguments
--interactionFile, -if

path to the interaction files which should be used for plotting

--range

Defines the region upstream and downstream of a reference point which should be included. Format is –region upstream downstream, e.g.: –region 500000 500000 plots 500kb up- and 500kb downstream. This value should not exceed the range used in the other chic-tools.

Default: [500000, 500000]

Optional arguments
--backgroundModelFile, -bmf

path to the background file which should be used for plotting

--interactionFileFolder, -iff

Folder where the interaction files are stored. Applies only for batch mode.

Default: “.”

--differentialTestResult, -dif

Path to the H0 rejected files to highlight the regions in the plot.

--significantInteractionFileFolder, -siff

Folder where the files with detected significant interactions are stored. Applies only for batch mode.

Default: “.”

--differentialTestResultsFolder, -diff

Folder where the H0 rejected files are stored. Applies only for batch mode.

Default: “.”

--significantInteractions, -si

Path to the files with detected significant interactions to highlight the regions in the plot.

--plotSignificantInteractions, -psi

Highlights the significant interactions in the plot itself. If not set, only the p-values are updated

Default: False

--outputFolder, -of

Output folder of the files.

Default: “.”

--outputFormat, -format

Output format of the plot.

Default: “png”

--dpi

Optional parameter: Resolution for the image, ifoutput is a raster graphics image (e.g png, jpg)

Default: 300

--binResolution, -r

Resolution of the bin in genomic units. Values are set as number of bases, e.g. 1000 for a 1kb, 5000 for a 5kb or 10000 for a 10kb resolution.

Default: 1000

--colorMapPvalue

Color map to use for the p-value. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html

Default: “RdYlBu”

--maxPValue, -map

Maximal value for p-value. Values above this threshold are set to this value.

Default: 0.1

--minPValue, -mp

Minimal value for p-value. Values below this threshold are set to this value.

Default: 0.0

--pValue, -p

Plot p-values as a colorbar

Default: False

--pValueSignificanceLevels, -psl

Highlight the p-values by the defined significance levels.

--xFold, -xf

Plot x-fold region for the mean background.

--truncateZeroPvalues, -tzpv

Sets all p-values which are equal to zero to one.

Default: False

--outFileName, -o

File name to save the image. Not used in batch mode.

--batchMode, -bm

The given file for –interactionFile and or –targetFile contain a list of the to be processed files.

Default: False

--plotSampleNumber, -psn

Number of samples per plot. Applies only in batch mode.

Default: 2

--colorList, -cl

Colorlist for the viewpoint lines.

Default: [‘g’, ‘b’, ‘c’, ‘m’, ‘y’, ‘k’]

--threads, -t

Number of threads (uses the python multiprocessing module).

Default: 4

--version

show program’s version number and exit

For single-cell Hi-C data analysis please use scHiCExplorer .

tool

type

input files

main output file(s)

application

hicFindRestSites

preprocessing

1 genome FASTA file

bed file with restriction site coordinates

Identifies the genomic locations of restriction sites

hicBuildMatrix

preprocessing

2 BAM/SAM files

hicMatrix object

Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads

hicCorrectMatrix

preprocessing

hicMatrix object

normalized hicMatrix object

Uses iterative correction or Knight-Ruiz to remove biases from a Hi-C matrix

hicMergeMatrixBins

preprocessing

hicMatrix object

hicMatrix object

Merges consecutive bins on a Hi-C matrix to reduce resolution

hicSumMatrices

preprocessing

2 or more hicMatrix objects

hicMatrix object

Adds Hi-C matrices of the same size

hicNormalize

preprocessing

multiple Hi-C matrices

multiple Hi-C matrices

Normalize data to 0 to 1 range or to smallest total read count

hicCorrelate

analysis

2 or more hicMatrix objects

a heatmap/scatter plot

Computes and visualizes the correlation of Hi-C matrices

hicFindTADs

analysis

hicMatrix object

bedGraph file (TAD score), a boundaries.bed file, a domains.bed file (TADs)

Identifies Topologically Associating Domains (TADs)

hicMergeDomains

analysis

multiple TAD domain files

tad domain file with merged tad locations multiple files with plotted TAD relations

Merges detect TADs locations of different resolutions; hierarchical relation between TADs as multiple plots

hicDifferentialTAD

analysis

two Hi-C matrices one TAD domain file

two diff_tad files: accepted H0 and rejected H0. Similar to BED6

Identifies differential Topologically Associating Domains (TADs) between two Hi-C matrices

hicPlotMatrix

visualization

hicMatrix object

a heatmap of Hi-C contacts

Plots a Hi-C matrix as a heatmap

hicPlotTADs

visualization

hicMatrix object, a config file

Hi-C contacts on a given region, along with other provided signal (bigWig) or regions (bed) file

Plots TADs as a track that can be combined with other tracks (genes, signal, interactions)

hicPlotDistVsCounts

visualization

hicMatrix object

log log plot of Hi-C contacts per distance

Quality control

hicConvertFormat

data integration

one/multiple Hi-C file formats

Hi-C matrices/outputs in several formats

Convert matrix to different formats

hicAdjustMatrix

data integration

one Hi-C file formats

Hi-C matrix

Removes, masks or keeps specified regions of a matrix

hicInfo

information

one or more hicMatrix objects

Screen info

Prints information about matrices, like size, maximum, minimum, bin size, etc.

hicPCA

analysis

one Hi-C matrix

bedgraph or bigwig file(s) for each eigenvector

Computes for A / B compartments the eigenvectors

hicTransform

analysis

one Hi-C matrix

Hi-C matrix

Computes a obs_exp matrix like Lieberman-Aiden (2009), a pearson correlation matrix and or a covariance matrix. These matrices can be used for plotting.

hicPlotViewpoint

visualization

one Hi-C matrix

A viewpoint plot

A plot with the interactions around a reference point or region.

hicQC

information

log files from hicBuildMatrix

A quality control report

Quality control of the created contact matrix.

hicQuickQC

information

2 BAM/SAM files

An estimated quality control report

Estimated quality report of the Hi-C data.

hicCompareMatrices

analysis

two Hi-C matrices

one Hi-C matrix

Applies diff, ratio or log2ratio on matrices to compare them.

hicAverageRegions

analysis

multiple Hi-C matrices

one npz object

Averages the given locations. Visualization with hicPlotAverageRegions

hicDetectLoops

analysis

one Hi-C matrices

bedgraph file with loop locations

Detects enriched regions. Visualization with hicPlotmatrix and –loop parameter.

hicValidateLocations

analysis

one loop, one protein peak file

bedgraph file with matched loop locations, one file with loop / protein statistics

Matches loop locations with protein peak positions

hicMergeLoops

analysis

multiple loop files

bedgraph file with merged loop locations

Merges detect loop locations of different resolutions

hicHyperoptDetectLoops

analysis

one Hi-C matrix, protein peaks

best parameter setting

Search for best parameter setting for hicDetectLoops

hicHyperoptDetectLoopsHiCCUPS

analysis

one Hi-C matrix, protein peaks

best parameter setting

Search for best parameter setting for Juicer’s HiCCUPS

hicCompartmentalization

visualization

one Hi-C interaction matrix one PCA bedgraph file

one image polarization plot

The global compartmentalization signal.

hicPlotAverageRegions

visualization

one npz file

one image

Visualization of hicAverageRegions.

hicPlotSVL

analysis

one / multiple Hi-C matrices

one image, p-values file, raw data file

Computes short/long range contacts; a box plot, a p-value and raw data file

hicMergeTADbins

preprocessing

one Hi-C matrix, one BED file

one Hi-C matrix

Uses a BED file of domains or TAD boundaries to merge the bin counts of a Hi-C matrix.

chicQualityControl

preprocessing

Hi-C matrices reference point BED file

two plots accepted reference point BED file

Checks for sparsity of viewpoints and removes them if too sparse.

chicViewpointBackgroundModel

preprocessing

Hi-C matrices reference point BED file

background model file

Creates a background model for all given samples and reference points.

chicViewpoint

preprocessing

Hi-C matrices background model file

viewpoint file(s)

Creates per sample per viewpoint one viewpoint file.

chicSignificantInteractions

chicAggregateStatistic

preprocessing analysis

viewpoint file(s) background model file

significant interaction file(s) target file(s)

Detects significant interactions per viewpoint based on the background and neighborhood merging via x-fold and loose p-values.

preprocessing

viewpoint files(s) target file (s)

aggregated file(s) for differential test

Aggregates for one viewpoint of two samples via a target file the locations to test for differential interactions.

chicDifferentialTest

analysis

aggregated file(s) of two samples

H0_accepted-, H0_rejected-files

Tests with chi2 or fisher for differential interactions of two samples.

chicPlotViewpoint

visualization

viewpoint file(s) differential expression file(s) significant interactions file(s)

one image per viewpoint

Visualization of a viewpoint.

General principles

A typical HiCExplorer command could look like this:

$ hicPlotMatrix -m myHiCmatrix.h5 \
-o myHiCmatrix.pdf \
--clearMaskedBins \
--region chrX:10,000,000-15,000,000 \
--vMin -4 --vMax 4 \

You can always see all available command-line options via –help:

$ hicPlotMatrix --help
  • Output format of plots should be indicated by the file ending, e.g. MyPlot.pdf will return a pdf file, MyPlot.png a png-file.

  • Most of the tools that produce plots can also output the underlying data - this can be useful in cases where you don’t like the HiCExplorer visualization, as you can then use the data matrices produced by deepTools with your favorite plotting tool, such as R.

  • The vast majority of command line options are also available in Galaxy (in a few cases with minor changes to their naming).

Example usage

Hi-C analysis of mouse ESCs using HiCExplorer

The following example shows how we can use HiCExplorer to analyze a published dataset. Here we are using a Hi-C dataset from Marks et. al. 2015, on mouse ESCs.

Protocol

The collection of the cells for Hi-C and the Hi-C sample preparation procedure was performed as previously described Lieberman-Aiden et al., with the slight modification that DpnII was used as restriction enzyme during initial digestion. Paired-end libraries were prepared according to Lieberman-Aiden et al. and sequenced on the NextSeq 500 platform using 2 × 75 bp sequencing.

Prepare for analysis
Download Raw fastq files

The fastq files can be downloaded from the EBI archive (or NCBI archive). We will store the files in the directory original_data.

mkdir original_data

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/007/SRR1956527/SRR1956527_1.fastq.gz -O original_data/SRR1956527_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/007/SRR1956527/SRR1956527_2.fastq.gz -O original_data/SRR1956527_2.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/008/SRR1956528/SRR1956528_1.fastq.gz -O original_data/SRR1956528_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/008/SRR1956528/SRR1956528_2.fastq.gz -O original_data/SRR1956528_2.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/009/SRR1956529/SRR1956529_1.fastq.gz -O original_data/SRR1956529_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/009/SRR1956529/SRR1956529_2.fastq.gz -O original_data/SRR1956529_2.fastq.gz
Create an index

We start with creating an index for our alignment software for the GRCm38/mm10 genome. As a source we use the mm10 genome from UCSC

mkdir genome_mm10
wget http://hgdownload-test.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz -O genome_mm10/chromFa.tar.gz
tar -xvzf genome_mm10/chromFa.tar.gz
cat genome_mm10/*.fa > genome_mm10/mm10.fa

We have the mm10 genome stored in one fasta file and can build the index. We tried it successfully with hisat2, bowtie2 and bwa. Run the mapping with one of them and do not mix them!

hisat2
hisat2-build -p 8 genome_mm10/mm10.fa hisat2/mm10_index

You can find more information about hisat

bowtie2
bowtie2-build genome_mm10/mm10.fa bowtie2/mm10_index --threads 8

You can find more information about bowtie

bwa
bwa index -p bwa/mm10_index genome_mm10/mm10.fa

You can find more information about bwa

Mapping the RAW files

Mates have to be mapped individually to avoid mapper specific heuristics designed for standard paired-end libraries.

It is important to have in mind for the different mappers:

  • for either bowtie2 or hisat2 use the –reorder parameter which tells bowtie2 or hisat2 to output the sam files in the exact same order as in the .fastq files.

  • use local mapping, in contrast to end-to-end. A fraction of Hi-C reads are chimeric and will not map end-to-end thus, local mapping is important to increase the number of mapped reads.

  • Tune the aligner parameters to penalize deletions and insertions. This is important to avoid aligned reads with gaps if they happen to be chimeric.

hisat2
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956527_1.fastq.gz --reorder | samtools view -Shb - > SRR1956527_1.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956527_2.fastq.gz --reorder | samtools view -Shb - > SRR1956527_2.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956528_1.fastq.gz --reorder | samtools view -Shb - > SRR1956528_1.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956528_2.fastq.gz --reorder | samtools view -Shb - > SRR1956528_2.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956529_1.fastq.gz --reorder | samtools view -Shb - > SRR1956529_1.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956529_2.fastq.gz --reorder | samtools view -Shb - > SRR1956529_2.bam
bowtie2
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956527_1.fastq.gz --reorder | samtools view -Shb - > SRR1956527_1.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956527_2.fastq.gz --reorder | samtools view -Shb - > SRR1956527_2.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956528_1.fastq.gz --reorder | samtools view -Shb - > SRR1956528_1.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956528_2.fastq.gz --reorder | samtools view -Shb - > SRR1956528_2.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956529_1.fastq.gz --reorder | samtools view -Shb - > SRR1956529_1.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956529_2.fastq.gz --reorder | samtools view -Shb - > SRR1956529_2.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956527_1.fastq.gz | samtools view -Shb - > SRR1956527_1.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956527_2.fastq.gz | samtools view -Shb - > SRR1956527_2.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956528_1.fastq.gz | samtools view -Shb - > SRR1956528_1.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956528_2.fastq.gz | samtools view -Shb - > SRR1956528_2.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956529_1.fastq.gz | samtools view -Shb - > SRR1956529_1.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956529_2.fastq.gz | samtools view -Shb - > SRR1956529_2.bam
Build, visualize and correct Hi-C matrix
Create a Hi-C matrix using the aligned files

In the following we will create three Hi-C matrices and merge them to one.

Build Hi-C matrix

hicBuildMatrix builds the matrix of read counts over the bins in the genome, considering the sites around the given restriction site. We need to provide:

  • the input BAM/SAM files: –samFiles SRR1956527_1.sam SRR1956527_2.sam

  • binsize: –binSize 1000

  • restriction sequence: –restrictionSequence GATC

  • dangling end sequence: –danglingSequence GATC

  • the restriction sequence cut site file: create it with hicFindRestSite, the restriction sequence ‘GATC’ and the above used mouse genome fasta file

  • the name of output bam file which contains the accepted alignments: –outBam SRR1956527_ref.bam

  • name of output matrix file: –outFileName hicMatrix/SRR1956527_10kb.h5

  • the folder for the quality report: –QCfolder hicMatrix/SRR1956527_QC

  • the number of to be used threads. Minimum value is 3: –threads 8

  • the buffer size for each thread buffering inputBufferSize lines of each input BAM/SAM file: –inputBufferSize 400000

To build the Hi-C matrices:

mkdir hicMatrix
hicBuildMatrix --samFiles SRR1956527_1.bam SRR1956527_2.bam --binSize 10000 --restrictionSequence GATC --danglingSequence GATC --restrictionCutFile cut_sites.bed --outBam SRR1956527_ref.bam --outFileName hicMatrix/SRR1956527_10kb.h5 --QCfolder hicMatrix/SRR1956527_10kb_QC --threads 8 --inputBufferSize 400000
hicBuildMatrix --samFiles SRR1956528_1.bam SRR1956528_2.bam --binSize 10000 --restrictionSequence GATC --danglingSequence GATC --restrictionCutFile cut_sites.bed --outBam SRR1956528_ref.bam --outFileName hicMatrix/SRR1956528_10kb.h5 --QCfolder hicMatrix/SRR1956528_10kb_QC --threads 8 --inputBufferSize 400000
hicBuildMatrix --samFiles SRR1956529_1.bam SRR1956529_2.bam --binSize 10000 --restrictionSequence GATC --danglingSequence GATC --restrictionCutFile cut_sites.bed --outBam SRR1956529_ref.bam --outFileName hicMatrix/SRR1956529_10kb.h5 --QCfolder hicMatrix/SRR1956529_10kb_QC --threads 8 --inputBufferSize 400000

The output bam files show that we have around 34M, 54M and 58M selected reads for SRR1956527, SRR1956528 & SRR1956529, respectively. Normally 25% of the total reads are selected. The output matrices have counts for the genomic regions. The extension of output matrix files is .h5.

A quality report is created in e.g. hicMatrix/SRR1956527_10kb_QC, have a look at the report hicQC.html.

The Hi-C quality report showing the results for 'pairs used & filtered'

A segment of Hi-C quality report.

Merge (sum) matrices from replicates

To increase the depth of reads we merge the counts from these three replicates.

hicSumMatrices --matrices hicMatrix/SRR1956527_10kb.h5 hicMatrix/SRR1956528_10kb.h5 \
        hicMatrix/SRR1956529_10kb.h5 --outFileName hicMatrix/replicateMerged_10kb.h5
Plot Hi-C matrix

A 10kb bin matrix is quite large to plot and is better to reduce the resolution (to know the size of a Hi-C matrix use the tool hicInfo), i.e. we usually run out of memory for a 1 kb or a 10 kb matrix and second, the time to plot is very long (minutes instead of seconds). For this we use the tool hicMergeMatrixBins.

Merge matrix bins for plotting

hicMergeMatrixBins merges the bins into larger bins of given number (specified by –numBins). We will merge 1000 bins in the original (uncorrected) matrix and then correct it. The new bin size is going to be 10.000 bp * 100 = 1.000.000 bp = 1 Mb

hicMergeMatrixBins \
--matrix hicMatrix/replicateMerged_10kb.h5 --numBins 100 \
--outFileName hicMatrix/replicateMerged.100bins.h5
Plot the corrected Hi-C matrix

hicPlotMatrix can plot the merged matrix. We use the following options:

  • the matrix to plot: –matrix hicMatrix/replicateMerged.100bins.h5

  • logarithmic values for plotting: –log1p

  • the resolution of the plot: –dpi 300

  • masked bins should not be plotted: –clearMaskedBins

  • the order of the chromosomes in the plot: –chromosomeOrder chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY

  • the color map: –colorMap jet

  • the title of the plot: –title “Hi-C matrix for mESC”

  • the plot image itself: –outFileName plots/plot_1Mb_matrix.png

mkdir plots
hicPlotMatrix \
--matrix hicMatrix/replicateMerged.100bins.h5 \
--log1p \
--dpi 300 \
--clearMaskedBins \
--chromosomeOrder chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY \
--colorMap jet \
--title "Hi-C matrix for mESC" \
--outFileName plots/plot_1Mb_matrix.png
corrected\_1Mb\_plot

The Hi-C interaction matrix with a resolution of 1 MB.

Correct Hi-C Matrix

hicCorrectMatrix corrects the matrix counts in an iterative manner. For correcting the matrix, it’s important to remove the unassembled scaffolds (e.g. NT_) and keep only chromosomes, as scaffolds create problems with matrix correction. Therefore we use the chromosome names (1-19, X, Y) here. Important: Use ‘chr1 chr2 chr3 etc.’ if your genome index uses chromosome names with the ‘chr’ prefix.

Matrix correction works in two steps: first a histogram containing the sum of contact per bin (row sum) is produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads. The second steps removes the low scoring bins and does the correction.

In the following we will use a matrix with a bin size of 20 kb: 10kb * 2 = 20 kb

hicMergeMatrixBins \
--matrix hicMatrix/replicateMerged_10kb.h5 --numBins 2 \
--outFileName hicMatrix/replicateMerged.matrix_20kb.h5

(1-19, X, Y) variant:

hicCorrectMatrix diagnostic_plot \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 --plotName hicMatrix/diagnostic_plot.png

(chr1-ch19, chrX, chrY) variant:

hicCorrectMatrix diagnostic_plot \
--chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 --plotName hicMatrix/diagnostic_plot.png
diagplot

Diagnostic plot for the Hi-C matrix at a resolution of 20 kb

The output of the program prints a threshold suggestion that is usually accurate but is better to revise the histogram plot. The threshold is visualized in the plot as a black vertical line. See Example usage for an example and for more info.

The threshold parameter needs two values:
  • low z-score

  • high z-score

“The absolute value of z represents the distance between the raw score and the population mean in units of the standard deviation. z is negative when the raw score is below the mean, positive when above.” (Source). For more information see wikipedia.

z-score definition: z = (x - my) / sigma

The z-score definition.

In our case the distribution describes the counts per bin of a genomic distance. To remove all bins with a z-score threshold less / more than X means to remove all bins which have less / more counts than X of mean of their specific distribution in units of the standard deviation.

Looking at the above distribution, we can select the value of -2 (lower end) and 3 (upper end) to remove. This is given by the –filterThreshold option in hicCorrectMatrix.

(1-19, X, Y) variant:

hicCorrectMatrix correct \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 \
--filterThreshold -2 3 --perchr --outFileName hicMatrix/replicateMerged.Corrected_20kb.h5

(chr1-ch19, chrX, chrY) variant:

hicCorrectMatrix correct \
--chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 \
--filterThreshold -2 3 --perchr --outFileName hicMatrix/replicateMerged.Corrected_20kb.h5

It can happen that the correction stops with:

`ERROR:iterative correction:*Error* matrix correction produced extremely large values.
This is often caused by bins of low counts. Use a more stringent filtering of bins.`

This can be solved by a more stringent z-score values for the filter threshold or by a look at the plotted matrix. In our case we see that chromosome Y is having more or less 0 counts in its bins. This chromosome can be excluded from the correction by not defining it for the set of chromosomes that should be corrected, parameter –chromosomes.

In the case of multiple samples / replicates that need to be normalized to the same read coverage we recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

Plot corrected matrix

We can now plot the one of the chromosomes (e.g. chromosome X) , with the corrected matrix.

New parameter:
  • The region to plot: –region chrX:10000000-2000000 or –region chrX

(1-19, X, Y) variant:

hicPlotMatrix \
--log1p --dpi 300 \
-matrix hicMatrix/replicateMerged.Corrected_20kb.npz \
--region X --title "Corrected Hi-C matrix for mESC : chrX" \
--outFileName plots/replicateMerged_Corrected-20kb_plot-chrX.png

(chr1-ch19, chrX, chrY) variant:

hicPlotMatrix \
--log1p --dpi 300 \
--matrix hicMatrix/replicateMerged.Corrected_20kb.npz \
--region chrX --title "Corrected Hi-C matrix for mESC : chrX" \
--outFileName plots/replicateMerged_Corrected-20kb_plot-chrX.png
correctMatrixPlot

The Hi-C interaction matrix for chromosome X.

Plot TADs

“The partitioning of chromosomes into topologically associating domains (TADs) is an emerging concept that is reshaping our understanding of gene regulation in the context of physical organization of the genome” [Ramirez et al. 2017].

Find TADs

TAD calling works in two steps: First HiCExplorer computes a TAD-separation score based on a z-score matrix for all bins. Then those bins having a local minimum of the TAD-separation score are evaluated with respect to the surrounding bins to decide assign a p-value. Then a cutoff is applied to select the bins more likely to be TAD boundaries.

hicFindTADs tries to identify sensible parameters but those can be change to identify more stringent set of boundaries.

mkdir TADs
hicFindTADs --matrix hicMatrix/replicateMerged.Corrected_20kb.h5 \
--minDepth 60000 --maxDepth 120000 --numberOfProcessors 8 --step 20000 \
--outPrefix TADs/marks_et-al_TADs_20kb-Bins  --minBoundaryDistance 80000 \
--correctForMultipleTesting fdr --threshold 0.05

As an output we get the boundaries, domains and scores separated files. We will use in the plot below only the TAD-score file.

Build Tracks File

We can plot the TADs for a given chromosomal region. For this we need to create a track file containing the instructions to build the plot. The hicPlotTADs documentation contains the instructions to build the track file.

In following plot we will use the listed track file. Please store it as track.ini.

[hic]
file = hicMatrix/replicateMerged.Corrected_20kb.h5
title = HiC mESC chrX:99974316-101359967
colormap = RdYlBu_r
depth = 2000000
height = 7
transform = log1p
file_type = hic_matrix

[tads]
file = TADs/marks_et-al_TADs_20kb-Bins_domains.bed
file_type = domains
border_color = black
color = none
line_width = 1.5
overlay_previous = share-y
show_data_range = no

[x-axis]
fontsize = 16
where = top

[tad score]
file = TADs/marks_et-al_TADs_20kb-Bins_score.bm
title = TAD separation score
height = 4
file_type = bedgraph_matrix

[spacer]

[gene track]
file = mm10_genes_sorted.bed
height = 10
title = mm10 genes
labels = false

We used as a gene track mm10 genes and sorted with sortBed from bedtools.

Plot

We plot the result with:

(1-19, X, Y) variant:

hicPlotTADs --tracks track.ini --region X:98000000-105000000 \
--dpi 300 --outFileName plots/marks_et-al_TADs.png \
--title "Marks et. al. TADs on X"

(chr1-ch19, chrX, chrY) variant:

hicPlotTADs --tracks track.ini --region chrX:98000000-105000000 \
--dpi 300 --outFileName plots/marks_et-al_TADs.png \
--title "Marks et. al. TADs on X"

The result is:

TADplot

TADplot

Importing and Exporting HiCExplorer data

Exporting HiCExplorer output to Bioconductor

It’s possible to export Hi-C Matrices produced by HiCExplorer to bioconductor in R, which allows us to use existing bioconductor infrastructure for differential Hi-C analysis. The tool hicConvertFormat allows us to write Hi-C matrices in a format that can easily be imported in bioconductor as GInteractions object. Below is an example.

## Assuming HiCExplorer is installed in ~/programs
hicConvertFormat --matrix ~/programs/HiCExplorer/test/test_data/Li_et_al_2015.h5 \ --inputFormat h5
-o GInteration_example --outputFormat GInteractions

The output file is in tsv format. It looks like this :

V1

V2

V3

V4

V5

V6

V7

X

19537

20701

X

19537

20701

1054.47483

X

19537

20701

X

20701

22321

375.86990

X

19537

20701

X

22321

24083

222.53900

X

19537

20701

X

24083

25983

114.26340

X

19537

20701

X

25983

27619

95.87463

This file can now be loaded into R as a GInteractions object, as shown below :

## INSIDE R
library(GenomicRanges)
library(InteractionSet)

hic <- read.delim("GInteraction_example.tsv", header = FALSE)

# Converting data.frame to GInteraction
convertToGI <- function(df){
            row.regions <- GRanges(df$V1, IRanges(df$V2,df$V3))# interaction start
            col.regions <- GRanges(df$V4, IRanges(df$V5,df$V6))# interaction end
            gi <- GInteractions(row.regions, col.regions)
            gi$norm.freq <- df$V7 # Interaction frequencies
            return(gi)
}
                        }
hic.gi <- convertToGI(hic)

Multiple files can be loaded, and converted to an InteractionSet object. If you have prepared matrices using binning, the intervals in the matrices must be the same. Therefore it’s easy to merge these matrices together in an InteractionSet object. In case some bins don’t match, we can merge the GInteraction objects based on matching bins, as follows.

# assuming hic.gi is a list of two GInteration objects hic.gi1 and hic.gi2
# hic.gi <- list(hic.gi1, hic.gi2)

# Get common regions between the two objects
combined <- unique(c(hic.gi$hic.gi1, hic.gi$hic.gi2))

# replace original regions with the common regions
replaceRegions(hic.gi$hic.gi1) <- regions(combined)
replaceRegions(hic.gi$hic.gi2) <- regions(combined)

# Get the matching indexes between the two objects
matched <- lapply(hic.gi, function(x) {
            match(x, combined)
            })

# Create a count matrix (for interaction frequencies)
counts <- matrix(0, ncol = 2, nrow=length(combined)) # counts for unmatched bins set to zero

# fill in the counts for matched bins
counts[matched$hic.gi1,1] <- hic.gi$hic.gi1$norm.freq
counts[matched$hic.gi2,2] <- hic.gi$hic.gi2$norm.freq

# Finally, create the InteractionSet object
iset <- InteractionSet(counts, combined)

InteractionSet objects can be used for packages like diffHic, for differential Hi-C analysis.

  • For more information on working with GInteraction and InteractionSet objects in bioconductor check out this vignette.

Captured Hi-C data analysis

How we use HiCExplorer to analyse cHi-C data

This How-to is based on the published dataset from Andrey et al. 2017. For the tutorial, we use the samples FL-E13.5 and MB-E-10.5.

Download the raw data

Please download the raw data via the following links or via NCBI GSE84795 .

Dataset

forward

reverse

CC-FL-E135-Wt-Mm-Rep1

SRR3950565_1

SRR3950565_2

CC-FL-E135-Wt-Mm-Rep2

SRR3950566_1

SRR3950566_2

CC-MB-E105-Wt-Mm-Rep1

SRR3950559_1

SRR3950559_2

CC-MB-E105-Wt-Mm-Rep2

SRR3950560_1

SRR3950560_2

Mapping

Map the files with a mapper of your choice against the mm9 reference genome; as an example, the mapping with bowtie2 is shown.

bowtie2 -x mm9_index --threads 8 -U SRR3950565_1.fastq.gz --reorder | samtools view -Shb - > SRR3950565_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950565_2.fastq.gz --reorder | samtools view -Shb - > SRR3950565_2.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950566_1.fastq.gz --reorder | samtools view -Shb - > SRR3950566_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950566_2.fastq.gz --reorder | samtools view -Shb - > SRR3950566_2.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950559_1.fastq.gz --reorder | samtools view -Shb - > SRR3950559_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950559_2.fastq.gz --reorder | samtools view -Shb - > SRR3950559_2.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950560_1.fastq.gz --reorder | samtools view -Shb - > SRR3950560_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950560_2.fastq.gz --reorder | samtools view -Shb - > SRR3950560_2.bam
Create cHi-C matrices

To create a cHi-C matrix we use HiCExplorer’s hicBuildMatrix for each replicate separately and merge the replicates into a single matrix later. Like Andrey et al., we use a resolution of 1kb and use the restriction enzyme DpnII.

hicBuildMatrix --samFiles SRR3950565_1.bam SRR3950565_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950565.cool --QCfolder SRR3950565_QC --threads 6
hicBuildMatrix --samFiles SRR3950566_1.bam SRR3950566_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950566.cool --QCfolder SRR3950566_QC --threads 6
hicBuildMatrix --samFiles SRR3950559_1.bam SRR3950559_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950559.cool --QCfolder SRR3950559_QC --threads 6
hicBuildMatrix --samFiles SRR3950560_1.bam SRR3950560_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950560.cool --QCfolder SRR3950560_QC --threads 6
hicSumMatrix --matrices SRR3950565.cool SRR3950566.cool --outFileName FL-E13-5.cool
hicSumMatrix --matrices SRR3950559.cool SRR3950560.cool --outFileName MB-E10-5.cool
Terminology: Reference point vs viewpoint

A reference point is one single genomic position i.e. chr1 500 510 is a reference point. A viewpoint is in contrast the region defined by the reference point and the up and downstream range, i.e. range 100 and reference point chr1 50 70 leads to the viewpoint chr1 400 610.

Creation of reference point file

Andrey et al. state that they used a total of 460 reference points, but that 24 were removed due to low sequence coverage or non-correspondence to a promoter region, leading to 446 in total.

To reproduce this, we need all reference points which are published in Supplementary Table S2 and S8.

It is simplest to create the reference point file in the following format using Excel and store it as a tab separated file:

chr1        4487435 4487435 Sox17

Otherwise, just download the prepared file. We will do the quality control on our own and compare with the results of Andrey et al.

Quality control

As a first step we compute the quality of each viewpoint by considering the sparsity. As soon as one viewpoint in one sample is less than the user-defined threshold (–sparsity), the reference point is no longer considered.

chicQualityControl -m FL-E13-5.cool MB-E10-5.cool -rp reference_points.bed --sparsity 0.025 --threads 20

The quality control creates five files: two plots showing the sparsity structure of the samples and three files containing the accepted reference points, the rejected ones and one file with all viewpoints and their sparsity level per sample.

In our example the plots look like the following:

_images/sparsity.png _images/histogram.png

The first plot shows the sparsity per sample for each viewpoint, while the second one shows the sparsity distribution as a histogram. It can be seen quite clearly that only a minority of the samples are really sparse and therefore need to be removed. The red line indicates the chosen sparsity level.

The reference point Tdap2b at chr1 19198995, which has a sparsity of 0.018 in FL-E13-5 and 0.016 in MB-E10-5, is considered to be of bad quality. To confirm this result we plot the viewpoint:

_images/Tfap2b_FL-E13-5_MB-E10-5_chr1_19198995_19198995.png

The plot shows there are effectively no interactions except with the reference point itself and confirm the point should be removed from the data.

The result of the quality control rejected 71 reference points as too sparse, but surprisingly the viewpoints rejected by Andrey et al. are accepted. An explanation for this could be that we only consider two samples and not all samples used by Andrey, and therefore we missed the bad quality of some viewpoints.

Please consider that this bad viewpoint was selected arbitrary out of the sample data and is only an example.

Download the data: Filtered reference points, Quality control raw data and rejected reference points.

Background model

The background model computes all viewpoints given by the reference points for both samples in a range defined by the parameter fixateRange. We recommend setting it to 500kb because real interactions above the range are rarely observed and very low interaction numbers such as 1 are already considered to be significant. With this setting, only the interactions in a range 500kb up- and downstream of the reference point are considered for each viewpoint. Based on this data, two background models are computed; the first one computes the average per relative distance to the reference point, and secondly, a negative binomial distribution per relative distance to the reference point is fitted. This first model is used for filtering in the significant interaction evaluation by an x-fold factor and for plotting. The negative binomial model is more important; it is used to compute a p-value per relative distance in each sample, which is used to make the decision if an interaction is considered as significant.

chicViewpointBackgroundModel -m FL-E13-5.cool MB-E10-5.cool --fixateRange 500000 -t 20 -rp reference_points.bed -o background_model.txt

The background model looks like this:

Relative position   size nbinom     prob nbinom     max value       mean value
-500000             75.895607451213 0.998528939430  2.333333333333  0.000101543771
-499000             90.348171762247 0.998725799952  2.750000000000  0.000104681360
-498000             78.512621775755 0.998514111424  2.800000000000  0.000106107536
-497000             75.706478185610 0.998327784087  3.800000000000  0.000116147819

You can download the background model.

Viewpoint computation

In this step the viewpoints for each reference point listed in a reference_points.bed-file is extracted from the interaction matrix, using the quality controlled file created by chicQualityControl. The up- and downstream range can be given via –range upstream downstream. Please use the same value for –fixateRange as in the background model computation. For each relative distance the x-fold over the average value of this relative distance is computed and each location is assigned a p-value based on the background negative binomial distribution for this relative distance. For each viewpoint one viewpoint file is created and stored in the folder given by the parameter –outputFolder.

chicViewpoint -m FL-E13-5.cool MB-E10-5.cool --averageContactBin 5 --range 1000000 1000000 -rp referencePoints.bed -bmf background_model.txt --writeFileNamesToFile interactionFiles.txt --outputFolder interactionFilesFolder --fixateRange 500000 --threads 20

The name of each viewpoint file starts with its sample name (given by the name of the matrix), the exact location and the gene / promoter name. For example, the viewpoint chr1 4487435 4487435 Sox17 from MB-E10-5.cool matrix will be called MB-E10-5_chr1_4487435_4487435_Sox17.txt and looks like the following:

# Interaction file, created with HiCExplorer's chicViewpoint version 3.2
# MB-E10-5.cool chr1_4487435_4487435    3.49  Mbp       5.49  Mbp       Sox17   Sum of interactions in fixate range: 978.0
# Chromosome    Start   End     Gene    Sum of interactions     Relative position       Relative Interactions   p-value x-fold  Raw
#
chr1    3487000 3488000 Sox17   978.0   -1000000        0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3488000 3489000 Sox17   978.0   -999000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3489000 3490000 Sox17   978.0   -998000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3490000 3491000 Sox17   978.0   -997000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3491000 3492000 Sox17   978.0   -996000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3492000 3493000 Sox17   978.0   -995000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3493000 3494000 Sox17   978.0   -994000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3494000 3495000 Sox17   978.0   -993000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3495000 3496000 Sox17   978.0   -992000 0.000000000000  0.894286365313  0.000000000000  0.000000000000
chr1    3496000 3497000 Sox17   978.0   -991000 0.000000000000  0.894286365313  0.000000000000  0.000000000000

Each file contains a header with information about the HiCExplorer version used, the sample, the viewpoint and the content of the different columns.

If the parameter –writeFileNamesToFile is set, the viewpoint file names are written to a file which can be used for batch processing in the later steps. Each sample is combined with every other sample for each viewpoint to enable the batch processing for the differential analysis. Example: matrices FL-E13-5.cool and MB-E10-5.cool with the three reference points:

FL-E13-5_chr1_4487435_4487435_Sox17.txt
MB-E10-5_chr1_4487435_4487435_Sox17.txt
FL-E13-5_chr1_14300280_14300280_Eya1.txt
MB-E10-5_chr1_14300280_14300280_Eya1.txt
FL-E13-5_chr1_19093103_19093103_Tfap2d.txt
MB-E10-5_chr1_19093103_19093103_Tfap2d.txt
Significant interactions detection

To detect significant interactions and to prepare a target file for each viewpoint which will be used for the differential analysis, the script chicSignificantInteractions is used. It offers two modes: either the user can specify an x-fold value or a loose p-value. The first one considers all interactions with a minimum x-fold over the average background for its relative distribution as a candidate or secondly, all interactions with a loose p-value or less are considered. These are pre-selection steps to be able to detect wider peaks in the same way as sharp ones. All detected candidates are merged to one peak if they are direct neighbors, and the sum of all interaction values of this neighborhood is used to compute a new p-value. The p-value is computed based on the relative distance negative binomial distribution of the interaction with the original highest interaction value. All peaks considered are accepted as significant interactions if their p-value is as small as the threshold –pvalue.

To exclude interactions with an interaction value smaller than desired the parameter –peakInteractionsThreshold can be set.

In this example we use the reference point Mstn at location chr1 53118507, a loose p-value of 0.1 and p-value of 0.01:

chicSignificantInteractions --interactionFile interactionFilesFolder/FL-E13-5_chr1_53118507_53118507_Mstn.txt -bmf background_model.txt --range 1000000 1000000 --pValue 0.01 --loosePValue 0.1

This creates two files:

FL-E13-5_chr1_53118507_53118507_Mstn_target.txt
FL-E13-5_chr1_53118507_53118507_Mstn__significant_interactions.txt

These files are stored in the folders given by the parameters –targetFolder and –outputFolder.

The significant interaction files looks like the following:

# FL-E13-5.cool     chr1_53118507_53118507  52.12  Mbp      54.12  Mbp      Mstn    Sum of interactions in fixate range: 1517.0
#Chromosome Start   End     Gene    Sum of interactions     Relative position       Relative interactions   p-value x-fold  Raw target
chr1        53318000        53321000        Mstn    1517.0  200000  0.00395517468600000040  0.00000145009991170397  8.37043994897500098773  6.00000000000000000000
chr1        53329000        53334000        Mstn    1517.0  212000  0.01081081081000000166  0.00000000000000188738  22.37661518795599846499 16.39999999999999857891
chr1        53348000        53350000        Mstn    1517.0  231000  0.00329597890600000004  0.00001463968364323609  7.37204640642899988734  5.00000000000000000000
chr1        53351000        53359000        Mstn    1517.0  239000  0.01437046802899999941  0.00000000000000099920  34.20972383882499912033 21.80000000000000071054
chr1        53393000        53401000        Mstn    1517.0  278000  0.01793012524599999977  0.00000000000000044409  48.20518935066399990319 27.19999999999999928946
chr1        53408000        53420000        Mstn    1517.0  294000  0.02307185234000000418  0.00000000000001743050  68.05162660125500906361 35.00000000000000000000

The target file looks like:

# Significant interactions result file of HiCExplorer's chicSignificantInteractions version 3.2-dev
# targetFolder/FL-E13-5_chr1_53118507_53118507_Mstn_target.txt
# Mode: loose p-value with 0.1
# Used p-value: 0.01
#
chr1        53318000        53321000
chr1        53329000        53334000
chr1        53348000        53359000
chr1        53393000        53401000
chr1        53408000        53420000
Batch mode

The batch mode supports the computation of many viewpoints at once and is able to create one target list for the same viewpoint and two (or n) samples. To do the batch computation the parameter –batchMode needs to be added and the folder of the viewpoint files needs to be defined. In addition, the list of viewpoints created by chicViewpoint with –writeFileNamesToFile needs to be used as input. One target file is created for n consecutive lines and can be defined via the –computeSampleNumber parameter. However, for the differential test where the target file is needed, only two samples and one target file is supported.

chicSignificantInteractions --interactionFile interactionFiles.txt --interactionFileFolder interactionFilesFolder/  -bmf background_model.txt --range 1000000 1000000 --pValue 0.01 --loosePValue 0.1 --batchMode

The output is:

  • A folder containing all target files, name defined by –targetFolder, default value is targetFolder

  • A folder with all significant interaction files, name defined by –outputFolder, default value is significantFiles

  • A list containing the file names of all target files, name defined by –targetFileList, default value is targetList.txt

  • A list containing the file names of all significant interaction files, name defined by –writeFileNamesToFile, default value is significantFilesBatch.txt

Aggregate data for differential test

The process to aggregate data is only necessary if the differential test is used. Either two files and one target list are used to generate the files for the differential test or the batch mode can be used. chicAggregateStatistic takes the created viewpoint files from chicViewpoint as input and either the target files per two samples created by chicSignificantInteractions or one target file which applies for all viewpoints.

chicAggregateStatistic --interactionFile interactionFilesFolder/FL-E13-5_chr1_53118507_53118507_Mstn.txt interactionFilesFolder/MB-E10-5_chr1_53118507_53118507_Mstn.txt --targetFile targetFolder/FL-E13-5_MB-E10-5_chr1_53118507_53118507_Mstn_target.txt

It selects the original data based on the target locations and returns one file per sample which is used for the differential test.

Batch mode

In the batch mode the interaction file is the file containing the viewpoint file names, the folder needs to be defined by –interactionFileFolder, the same applies to the target file and folder. Two viewpoint files are match with one target file created by chicSignificantInteractions or one target file for all viewpoints. Please notice the output files are written to the folder name defined by –outputFolder, the default is aggregatedFiles and it is recommended to write the file names for further batch processing with hicDifferentialTest to –writeFileNamesToFile. All output files get the suffix defined by –outFileNameSuffix, default value is _aggregate_target.txt.

chicAggregateStatistic --interactionFile interactionFiles.txt --interactionFileFolder interactionFilesFolder --targetFile targetList.txt --targetFileFolder targetFolder --batchMode
Differential test

The differential test tests the interaction value of the reference point and the interaction value of the target of two samples for a differential expression. To achieve this, either Fisher’s test or the chi-squared test can be used. H0 is defined as ‘both locations are equal’, meaning the differential expressed targets can be found in the H0 rejected file.

This can be computed per sample:

chicDifferentialTest --interactionFile aggregatedFiles/FL-E13-5_chr1_53118507_53118507_Mstn__aggregate_target.txt aggregatedFiles/MB-E10-5_chr1_53118507_53118507_Mstn__aggregate_target.txt --alpha 0.05 --statisticTest fisher

Or via batch mode:

chicDifferentialTest --interactionFile aggregatedFilesBatch.txt --interactionFileFolder aggregatedFiles --alpha 0.05 --statisticTest fisher --batchMode --threads 20

In both cases it is important to set the desired alpha value and the output is written to –outputFolder (default differentialResults). For each sample three files are created:

  • H0 rejected targets

  • H0 accepted targets

  • one file containing both

In the batch mode, the file –rejectedFileNamesToFile is also written and contains the file names of the rejected files. This can be used for the batch processing mode of chicPlotViewpoint.

# Differential analysis result file of HiCExplorer's chicDifferentialTest version 3.2-dev
# This file contains the p-values computed by fisher test
# To test the smoothed (float) values they were rounded up to the next integer
#
# Alpha level 0.05
# Degrees of freedom 1
#
# FL-E13-5.cool     chr1_53118507_53118507  52.12  Mbp      54.12  Mbp      Mstn    Sum of interactions in fixate range: 1517.0
# MB-E10-5.cool     chr1_53118507_53118507  52.12  Mbp      54.12  Mbp      Mstn    Sum of interactions in fixate range: 1670.0
#Chromosome Start   End     Gene    Relative distance       sum of interactions 1   target_1 raw    sum of interactions 2   target_2 raw    p-value
chr1        53089000        53091000        Mstn    -28000  1517.00000      5.00000 1670.00000      10.40000                0.21800
chr1        53131000        53133000        Mstn    14000   1517.00000      18.20000        1670.00000      23.60000                0.75900
chr1        53156000        53158000        Mstn    39000   1517.00000      3.00000 1670.00000      10.80000                0.06117
chr1        53251000        53254000        Mstn    135000  1517.00000      4.00000 1670.00000      9.60000         0.18614
chr1        53287000        53291000        Mstn    172000  1517.00000      7.20000 1670.00000      15.00000                0.29506
chr1        53305000        53309000        Mstn    190000  1517.00000      6.20000 1670.00000      12.40000                0.36952
chr1        53318000        53321000        Mstn    202000  1517.00000      6.00000 1670.00000      3.20000         0.53309
chr1        53326000        53334000        Mstn    215000  1517.00000      25.80000        1670.00000      22.60000                0.47374
chr1        53346000        53359000        Mstn    240000  1517.00000      31.60000        1670.00000      22.20000                0.13464
chr1        53408000        53421000        Mstn    302000  1517.00000      36.40000        1670.00000      28.20000                0.21290
# Differential analysis result file of HiCExplorer's chicDifferentialTest version 3.2-dev
# This file contains the p-values computed by fisher test
# To test the smoothed (float) values they were rounded up to the next integer
#
# Alpha level 0.05
# Degrees of freedom 1
#
# FL-E13-5.cool     chr1_53118507_53118507  52.12  Mbp      54.12  Mbp      Mstn    Sum of interactions in fixate range: 1517.0
# MB-E10-5.cool     chr1_53118507_53118507  52.12  Mbp      54.12  Mbp      Mstn    Sum of interactions in fixate range: 1670.0
#Chromosome Start   End     Gene    Relative distance       sum of interactions 1   target_1 raw    sum of interactions 2   target_2 raw    p-value
chr1        53393000        53401000        Mstn    282000  1517.00000      27.20000        1670.00000      6.40000         0.00012
# Differential analysis result file of HiCExplorer's chicDifferentialTest version 3.2-dev
# This file contains the p-values computed by fisher test
# To test the smoothed (float) values they were rounded up to the next integer
#
# Alpha level 0.05
# Degrees of freedom 1
#
# FL-E13-5.cool     chr1_53118507_53118507  52.12  Mbp      54.12  Mbp      Mstn    Sum of interactions in fixate range: 1517.0
# MB-E10-5.cool     chr1_53118507_53118507  52.12  Mbp      54.12  Mbp      Mstn    Sum of interactions in fixate range: 1670.0
#Chromosome Start   End     Gene    Relative distance       sum of interactions 1   target_1 raw    sum of interactions 2   target_2 raw    p-value
chr1        53089000        53091000        Mstn    -28000  1517.00000      5.00000 1670.00000      10.40000                0.21800
chr1        53131000        53133000        Mstn    14000   1517.00000      18.20000        1670.00000      23.60000                0.75900
chr1        53156000        53158000        Mstn    39000   1517.00000      3.00000 1670.00000      10.80000                0.06117
chr1        53251000        53254000        Mstn    135000  1517.00000      4.00000 1670.00000      9.60000         0.18614
chr1        53287000        53291000        Mstn    172000  1517.00000      7.20000 1670.00000      15.00000                0.29506
chr1        53305000        53309000        Mstn    190000  1517.00000      6.20000 1670.00000      12.40000                0.36952
chr1        53318000        53321000        Mstn    202000  1517.00000      6.00000 1670.00000      3.20000         0.53309
chr1        53326000        53334000        Mstn    215000  1517.00000      25.80000        1670.00000      22.60000                0.47374
chr1        53346000        53359000        Mstn    240000  1517.00000      31.60000        1670.00000      22.20000                0.13464
chr1        53393000        53401000        Mstn    282000  1517.00000      27.20000        1670.00000      6.40000         0.00012
chr1        53408000        53421000        Mstn    302000  1517.00000      36.40000        1670.00000      28.20000                0.21290
Plotting of Viewpoints

chicPlotViewpoint can plot n viewpoints in one plot, add the mean background, show the p-value per relative distance per sample as an additional heatmap bar and highlight significant interactions or differential expressed regions.

One viewpoint:

chicPlotViewpoint --interactionFile interactionFilesFolder/FL-E13-5_chr1_53118507_53118507_Mstn.txt --range 200000 200000 -o single_plot.png
_images/single_plot.png

Two viewpoints, background, differential expression and p-values:

chicPlotViewpoint --interactionFile interactionFilesFolder/FL-E13-5_chr1_53118507_53118507_Mstn.txt interactionFilesFolder/MB-E10-5_chr1_53118507_53118507_Mstn.txt --range 300000 300000 --pValue --differentialTestResult differentialResults/FL-E13-5_MB-E10-5_chr1_53118507_53118507_Mstn__H0_rejected.txt --backgroundModelFile background_model.txt -o differential_background_pvalue.png
_images/differential_background_pvalue.png

Two viewpoints, background, significant interactions and p-values:

chicPlotViewpoint --interactionFile interactionFilesFolder/FL-E13-5_chr1_53118507_53118507_Mstn.txt interactionFilesFolder/MB-E10-5_chr1_53118507_53118507_Mstn.txt --range 300000 300000 --pValue --significantInteractions significantFiles/FL-E13-5_chr1_53118507_53118507_Mstn__significant_interactions.txt significantFiles/MB-E10-5_chr1_53118507_53118507_Mstn__significant_interactions.txt --backgroundModelFile background_model.txt -o significant_background_pvalue.png
_images/significant_background_pvalue.png
Batch mode

The batch mode is able to plot files under the same parameter setting for multiple viewpoints. These viewpoints are given by the file created by chicViewpoint with –writeFileNamesToFile parameter. The number of consecutive lines which should be plotted to one image can be defined using –plotSampleNumber. If the differentially expressed regions should be highlighted, setting this parameter to 2 is recommended.

For all modes the principle of a file containing the file names and a folder containing them applies for the plotting too, and using all cores available is highly recommended.

chicPlotViewpoint --interactionFile interactionFiles.txt --interactionFileFolder interactionFilesFolder/ --range 300000 300000 --pValue --significantInteractions significantFilesBatch.txt --significantInteractionFileFolder significantFiles --backgroundModelFile background_model.txt --outputFolder plots --threads 20 --batchMode

How we use HiCExplorer

To generate a Hi-C contact matrix is necessary to perform the following basic steps

  1. Map the Hi-C reads to the reference genome

  2. Filter the aligned reads to create a contact matrix

  3. Filter matrix bins with low or zero read coverage

  4. Remove biases from the Hi-C contact matrices

After a corrected Hi-C matrix is created other tools can be used to visualize it, call TADS or compare it with other matrices.

Reads mapping

Mates have to be mapped individually to avoid mapper specific heuristics designed for standard paired-end libraries.

We have used the HiCExplorer successfully with bwa, bowtie2 and hisat2. However, it is important to:

  • for either bowtie2`or `hisat2 use the –reorder parameter which tells bowtie2 or hisat2 to output the sam files in the exact same order as in the .fastq files.

  • use local mapping, in contrast to end-to-end. A fraction of Hi-C reads are chimeric and will not map end-to-end thus, local mapping is important to increase the number of mapped reads.

  • Tune the aligner parameters to penalize deletions and insertions. This is important to avoid aligned reads with gaps if they happen to be chimeric.

# map the reads, each mate individually using
# for example bwa
#
# bwa mem mapping options:
#       -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
#       -B INT        penalty for a mismatch [4]
#       -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
#       -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1] # this is set very high to avoid gaps
#                                  at restriction sites. Setting the gap extension penalty high, produces better results as
#                                  the sequences left and right of a restriction site are mapped independently.
#       -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5] # this is set to no penalty.

$ bwa mem -A1 -B4  -E50 -L0  index_path \
    mate_R1.fastq.gz 2>>mate_R1.log | samtools view -Shb - > mate_R1.bam

$ bwa mem -A1 -B4  -E50 -L0  index_path \
    mate_R2.fastq.gz 2>>mate_R2.log | samtools view -Shb - > mate_R2.bam
Creation of a Hi-C matrix

Once the reads have been mapped the Hi-C matrix can be built. For this, the minimal extra information required is the binSize used for the matrix. Is it best to enter a low number like 10.000 because lower resolution matrices (larger bins) can be easily constructed using hicMergeMatrixBins. Matrices at restriction fragment resolution can be created by providing a file containing the restriction sites, this file can be created with the tool hicFindRestSites

hicFindRestSites that is part of HiCExplorer.

# build matrix from independently mated read pairs
# the restriction sequence GATC is recognized by the DpnII restriction enzyme

$ hicBuildMatrix --samFiles mate_R1.bam mate_R2.bam \
                 --binSize 10000 \
                 --restrictionSequence GATC \
                 --danglingSequence GATC \
                 --restrictionCutFile cut_sites.bed \
                 --threads 4 \
                 --inputBufferSize 100000 \
                 --outBam hic.bam \
                 -o hic_matrix.h5 \
                 --QCfolder ./hicQC

hicBuildMatrix creates two files, a bam file containing only the valid Hi-C read pairs and a matrix containing the Hi-C contacts at the given resolution. The bam file is useful to check the quality of the Hi-C library on the genome browser. A good Hi-C library should contain piles of reads near the restriction fragment sites. In the QCfolder a html file is saved with plots containing useful information for the quality control of the Hi-C sample like the number of valid pairs, duplicated pairs, self-ligations etc. Usually, only 25%-40% of the reads are valid and used to build the Hi-C matrix mostly because of the reads that are on repetitive regions that need to be discarded.

An important quality control measurement to check is the inter chromosomal fraction of reads as this is an indirect measure of random Hi-C contacts. Good Hi-C libraries have lower than 10% inter chromosomal contacts. The hicQC module can be used to compare the QC measures from different samples.

Correction of Hi-C matrix

The Hi-C matrix has to be corrected to remove GC, open chromatin biases and, most importantly, to normalize the number of restriction sites per bin. Because a fraction of bins from repetitive regions contain few contacts it is necessary to filter those regions first. Also, in mammalian genomes some regions enriched by reads should be discarded. To aid in the filtering of regions hicCorrectMatrix generates a diagnostic plot as follows:

$ hicCorrectMatrix diagnostic_plot -m hic_matrix.h5 -o hic_corrected.png

The plot should look like this:

_images/diagnostic_plot.png

Histogram of the number of counts per bin.

For the upper threshold is only important to remove very high outliers and thus a value of 5 could be used. For the lower threshold it is recommended to use a value between -2 and -1. What it not desired is to try to correct low count bins which could result simply in an amplification of noise. For the upper threshold is not so concerning because those bins will be scaled down.

Once the thresholds have been decided, the matrix can be corrected

# correct Hi-C matrix
$ hicCorrectMatrix correct -m hic_matrix.h5 --filterThreshold -1.5 5 -o hic_corrected.h5

In the case of multiple samples / replicates that need to be normalized to the same read coverage we recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

Visualization of results

There are two ways to see the resulting matrix, one using hicPlotMatrix and the other is using hicPlotTADs. The first one allows the visualization over large regions while the second one is preferred to see specific parts together with other information, for example genes or bigwig tracks.

Because of the large differences in counts found int he matrix, it is better to plot the counts using the –log1p option.

$ hicPlotMatrix -m hic_corrected.h5 -o hic_plot.png --region 1:20000000-80000000 --log1p
_images/corrected_matrix_example.png

Corrected Hi-C counts in log scale.

Quality control of Hi-C data and biological replicates comparison

HiCExplorer integrates multiple tools that allow the evaluation of the quality of Hi-C libraries and matrices.

  • hicQC on the log files produced by hicBuildMatrix and control of the pdf file produced.

Proportion of useful reads is important to assess the efficiency of the HiC protocol, which is dependant of proportion of dangling ends detected… Proportion of inter chromosomal, short range and long range contacts are important for….

  • hicPlotDistVsCounts to compare the distribution of corrected Hi-C counts in relation with the genomic

distance between multiple samples. If some differences are observed between biological replicates, these can be investigated more precisely by computing log2ratio matrices.

  • hicCompareMatrices log2ratio of matrices of biological replicates to identify where the potential changes are located.

  • hicPlotPCA bins correlation of two biological replicates.

TAD calling

To call TADs a corrected matrix is needed. Restriction fragment resolution matrices provide the best results. TAD calling works in two steps: First HiCExplorer computes a TAD-separation score based on a z-score matrix for all bins. Then those bins having a local minimum of the TAD-separation score are evaluated with respect to the surrounding bins to decide assign a p-value. Then a cutoff is applied to select the bins more likely to be TAD boundaries.

$ hicFindTADs -m hic_corrected.h5 --outPrefix hic_corrected --numberOfProcessors 16

This code will produce several files: 1. The TAD-separation score file, 2. the z-score matrix, 3. a bed file with the boundary location, 4. a bed file with the domains, 5. a bedgraph file with the TAD-score that can be visualized in a genome browser.

The TAD-separation score and the matrix can be visualized using hicPlotTADs.

_images/chorogenome_example.jpg

Example output from hicPlotTADs from http://chorogenome.ie-freiburg.mpg.de/

A / B compartment analysis

To compute the A / B compartments the matrix needs to be transformed to an observed/expected matrix in the way Lieberman-Aiden describes it. In a next step a pearson correlation matrix and based on it a covariance matrix is computed. Finally the eigenvectors based on the covariance matrix are computed. All these steps are computed with the command:

$ hicPCA -m hic_corrected.h5 --outFileName pca1.bw pca2.bw --format bigwig

If the intermediate matrices of this process should be used for plotting run:

$ hicTransform -m hic_corrected.h5 --outFileName all.h5 --method all

This creates all intermediate matrices: obs_exp_all.h5, pearson_all.h5 and covariance_all.h5.

The A / B compartments can be plotted with hicPlotMatrix.

$ hicPlotMatrix -m pearson_all.h5 --outFileName pca1.png --perChr --bigwig pca1.bw
_images/eigenvector1_lieberman.png

News and Developments

Release 3.5.3

14 October 2020

  • Bug fix release:
    • Reads from scaffolds without any restriction enzym cut site are considered as ‘same fragment’. An appearance of such a read will not lead to a crash anymore

  • Minor documentation improvements

Release 3.5.2

06 October 2020

  • Bug fix release:
    • enforcing version 15 of HiCMatrix. Version 14 had a bug concerning the application of the correction factors of cool files. See issue #595

    • Fixing a bug in hicDetectLoops in single-core mode. Thanks @alecw

    • Fixing a bug in hicDifferentialTAD concerning multiple chromosomes in the bed file. See issue #587

    • Updating dependencies to newest versions, except biopython. Forcing here <1.77 because their API change is breaking our source code. See issue #609

    • Fixing #596

  • Changing internal structure of the docs. Navigation should be better now.

Release 3.5.1

11 August 2020

  • patch hicCorrelate

  • hicBuildMatrix: Better help text

  • patch for hicPlotMatrix if matrix does not start at 0 and –region is used

  • bug fix for remove option in hicAdjustMatrix

Release 3.5

10 July 2020

  • Major update for hicDetectLoops: Results are now closer to HiCCUPS, it is faster and needs less memory.

  • hicHyperoptDetectLoops: New tool to compute the best parameter setting if the number of expected loops is given and a protein boundary file (e.g. CTCF on mammals) is provided

  • hicHyperoptDetectLoopsHiCCUPS: New tool to compute the best parameter setting for Juicers HiCCUPS. HiCCUPS and all its dependencies are not part of HiCExplorer and must be provided by the user. Number of expected loops and a protein boundary file (e.g. CTCF on mammals) must be provided.

  • hicMergeDomains: New tool to compute a merge of TADs computed on different resolutions. Moreover it provides a cleaning of the boundaries with the help of a protein peak file, and the hierarchical dependencies of TADs can be plotted. This tool is the result of the Bachelor thesis from Sarah Domogalla (@SarahMaria27). Thanks for your contribution!

  • hicDifferentialTAD: New tool to compute differential TADs between two samples

  • Bug fix for hicFindTADs: The format of the intermediate z-score matrices are now depending on the format of the input Hi-C matrix

  • Bug fix for chic*-modules: Fixate range is now correct applied.

  • chicSignificantInteractions, hicDetectLoops: Option to use a per relative position p-value with a p-value threshold file

  • Adding hicCreateThresholdFile: A script to generate a per position p-value threshold for hicDetectLoops and chicSignificantInteractions

  • Bug fix for hicPlotMatrix:
    • multiple bigwig tracks in the vertical setting are now supported

    • correct plot of bigwig if the given matrix does not start at the beginning of the chromosome

    • new parameters ‘increaseFigureWidth’ and ‘increaseFigureHeight’ to add more space to the plot if multiple bigwigs are plotted. Adjust this parameter to correct the plot of the heatmap which may be not quadratic anymore.

    • restriction of the loop regions to the user given range. This effects especially SVGs that will now contain less objects as before.

  • New feature for hicBuildMatrix:
    • multiple restriction cut sequences, dangling ends and restriction cut sites files are now supported

    • restriction cut sequences, dangling ends and restriction cut sites files are now mandatory parameters. This is now enforced to guarantee a correct detection of self ligations and self circles

  • hicPrepareQCreport: New support for multiple dangling ends

  • hicQuickQC: restriction cut sequences, dangling ends and restriction cut sites files are now mandatory parameters

  • hicFindRestSite: gz support for fasta file

  • Add fallback modes to multiple scripts if the parallelization fails.

  • hicAggregate: interactions between two bed files by comparing every row in the first bed file with its corresponding row in the second file. (issue #390)

  • hicAdjustMatrix: fix #341, 454
    • fixed –action remove: it actually remove the part from the matrix previously was masking it

    • the case where the end of the chromosome need to be removed.

  • New Azure testing for the general test cases, the trivial ones run on travis

Publication

02 July 2020

Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization Joachim Wolff, Leily Rabbani, Ralf Gilsbach, Gautier Richard, Thomas Manke, Rolf Backofen, Björn A Grüning Nucleic Acids Research, Volume 48, Issue W1, 02 July 2020, Pages W177–W184

Release 3.4.3

23 March 2020

  • Fixing the wrong p-value computation in for chicViewpoint. New method is more accurate for floating points.

  • Fixing a bug in chicViewpointBackgroundModel and chicQualityControl if an non-existing reference point was used.

  • Improving all chic* modules with a capturing of errors in sub-processes. It is now guaranteed that the computation will terminate. Either successful or by error.

  • Add option ‘truncateZero’ to chicViewpointBackgroundModel: This removes all zero values for the distributions before fitting to fight over dispersion.

  • Add option ‘decimalPlaces’ to chicViewpoint to adjust the decimal places in the output for all floating values. Helpful for really small p-values

  • Add option ‘truncateZeroPvalues’ to chicSignificantInteractions to set all p-values which are 0 to 1 and are therefore ignored.

  • Add option ‘truncateZeroPvalues’ to chicPlotViewpoint to set all p-values which are 0 to 1 and do not disturb the presentation of real p-values

Release 3.4.2

7 March 2020

  • This release fixes the wrong name scheme which was used in the chicModules. The most .bed files are now .txt files.

  • hicDetectLoops got an inner chromosome parallelization to decrease the compute time.

  • hicPlotMatrix got three new parameters: rotationX, rotationY and fontSize to adjust the position and font size of the labels. We hope this can lead in certain cases to a a better readability

  • hicPlotMatrix: fixed a bug that occurred if the list of chromosomes was given and the last chromosome appeared as an additional label.

  • Improving and updating the documentation.

Preprint

**6 March 2020*

The preprint of the loop detection algorithm is online via biorXiv: https://www.biorxiv.org/content/10.1101/2020.03.05.979096v1

Release 3.4.1

3 February 2020

  • This release fixes a bug in chicViewpoint that caused a crash if the data to be averaged is less than the window size.

Release 3.4

28 January 2020

  • Fixing a bug in hicAdjustMatrix: keep option had a bug concerning the cutting before the end of a chromosome or the start position was not from the beginning of the chromosome

  • hicCompartmentPolarization was renamed to hicCompartmentalization and got some bug fixes

  • Extending the option on how the observed vs. Expected matrix is computed and adding the parameter –ligation_factor to achieve a rescale behaviour of the values as it is implemented in Homer. The same changes are applied to hicTransform

  • Improved the documentation

  • Adding an option in hicAverageRegions to select start, end, center or start_end as start index for up/downstream range.

  • hicBuildMatrix: Removed default value of binSize to enable mutually exclusive group error if not one of them is set. Behaviour so far was that the binSize was taken.

  • hicPlotSVL: adding xlegend to plot of SVL ratios to indicate the data points per boxplots are the chromosome ratios

  • hicQuickQC: Removed binSize option of hicQuickQC because it does not matter for QC calculation and adding a sentence to recommend the usage of restriction enzyme and dangling end sequence. Fixing bug issue #464

  • hicNormalize: Adding option in hicNormalize to remove values after the normalization if values are smaller than a given threshold

  • Capture Hi-C modules: Change background model distribution assumption from negative binomial to continuous negative binomial by using Gamma functions as a replacement for the binomial coefficient. Source: https://stats.stackexchange.com/questions/310676/continuous-generalization-of-the-negative-binomial-distribution/311927

  • hicInfo: Implementing feature request #456. The length of chromosomes is now show in the information too

Release 3.3.1

15 November 2019

  • Fixing a bug in the labeling of chicPlotViewpoints if the value range is counted in MB

  • Add an option to chicViewpoint to pre-compute a x-fold of p-value over the maximum value of the relative distance

Release 3.3

8 October 2019

  • Fixing many bugs:
    • A bug in hicDetectLoops if a sub-matrix was very small

    • A bug in hicPlotMatrix if the region defined by –region was only a chromosome and loops should be plotted too

    • A bug in hicPlotMatrix if a loop region should be plotted and chromosomeOrder argument was used too

    • A bug in hicAggregateContacts (issue #405) if chromosomes were present in the matrix but not in the bed file

    • A bug in hicAdjustMatrix concerning a bed file and consecutive regions, see issue #343

    • A bug in hicAdjustMatrix if a chromosome is present in the matrix but not in the bed file, see issue #397

    • A bug in hicCompartmentsPolarization concerning the arguments ‘quantile’ and ‘outliers’ were interpreted as strings but should be integers

    • A bug in hicAdjustMatrix concerning the ‘keep’ option and how matrices are reordered internally. Thanks @LeilyR

  • Added features as requested:
    • hicPCA ignores now masked bins, see issue #342

    • chicPlotViewpoint:
      • Better legend handling on x-axis

      • Peaks are now display with their fill width

      • Add option –pValueSignificantLevels to categorize the p-values in x levels (e.g. 0.001 0.05 0.1)

    • chicViewpoint:
      • adding sorting via viewpoints and not by samples option (–allViewpointsList)

    • Adding an option to hicNormalize to normalize via multiplication and a use defined value (see issues #385, #424)

  • Rearrange hicAdjustMatrix to have a better accessibility to its functions from outside of main

  • Improving the documentation and fixing grammar / spelling mistakes. Thanks @simonbray

  • New script: hicPlotSVL to investigate short range vs long range ratios.

Release 3.2

** 22 August 2019**

  • Adding the new captured Hi-C module. Viewpoint analysis based on a background model, significant interaction detection and differential analysis are provided.

  • Adding documentation for captured Hi-C module and a tutorial on how to use it.

  • Adding a module to be able to detect quite fast the quality of a Hi-C data set: hicQuickQC.

  • Adding a tool to merge loops of different resolutions.

  • Improving validation of locations: Presorting is no longer necessary; adding feature to add ‘chr’ prefix to loop or protein chromosome name

  • Change loop detection slightly to improve results and fixed bugs:
    • preselection p-value was ignored and only p-value was used

    • adding additional test to the peak region test to decrease false discoveries

    • exchanging pThreshold / ln(distance) to remove too low values by a share of the maximum value of the distance. New parameter ‘maximumInteractionPercentageThreshold’

  • Removal of the folder ‘scripts’ and its content. These were outdated scripts and will maybe part of regular Hi-C tools in the future.

Release 3.1

9 July 2019

  • KR correction improvements: It is now able to process larger data sets like GM12878 primary+replicate on 10kb resolution.

  • Adding script for validation of loop locations with protein peak locations

  • Adding script hicCompartmentsPolarization: Rearrange the average interaction frequencies using the first PC values to represent the global compartmentalization signal

Release 3.0.2

28 June 2019

  • Pinning dependencies to:

    • hicmatrix version 9: API changes in version 10

    • krbalancing version 0.0.4: API changes in version 0.0.5

    • matplotlib version 3.0: Version 3.1 raises ‘Not implemented error’ for unknown reasons.

  • Set fit_nbinom to version 1.1: Version 1.0 Had deprecated function call of scipy > 1.2.

  • Small documentation fixes and improvements.

Release 3.0.1

5 April 2019

  • Fixes KR balancing correction factors

  • Deactivates log.debug

Release 3.0

3 April 2019

  • Python 3 only. Python 2.X is no longer supported

  • Additional Hi-C interaction matrix correction algorithm ‘Knight-Ruiz’ as a C++ module for a faster runtime and less memory usage.

  • Enriched regions detection tool: ‘hicDetectLoops’ based on strict candidate selection, ‘hicFindEnrichedContacts’ was deleted

  • Metadata for cooler files is supported: hicBuildMatrix and hicInfo are using it

  • New options for hicPlotMatrix: –loops to visualize computed loops from hicDetectLoops and –bigwigAdditionalVerticalAxis to display a bigwig track on the vertical axis too.

Release 2.2.3

22 March 2019

  • This bug fix release patches an issue with cooler files, hicBuildMatrix and the usage of a restriction sequence file instead of fixed bin size.

Release 2.2.2

27 February 2019

  • This bug fix release removes reference to hicExport that were forgotten to delete in 2.2. Thanks @BioGeek for this contribution.

Release 2.2.1

7 February 2019

  • Muting log output of matplotlib and cooler

  • Set version number of hicmatrix to 7

  • Optional parameter for hicInfo to write the result to a file instead to the bash

Release 2.2

18 January 2019

This release contains:

  • replaced hicExport by hicConvertFormat and hicAdjustMatrix

  • extended functionality for hicConvertFormat

    • read support for homer, hicpro, cool, h5

    • write support for h5, homer, cool

    • convert hic to cool

    • creation of mcool matrices

  • hicAdjustMatrix

    • remove, keep or mask specified regions from a file, or chromosomes

  • hicNormalize

    • normalize matrices to 0 - 1 range or to the read coverage of the lowest given

  • hicBuildMatrix

    • support for build mcool

  • restructuring the central class HiCMatrix to object oriented model and moved to its own library: deeptools/HiCMatrix.

    • Extended read / write support for file formats

    • better (faster, less memory) support for cool format

    • remove of old, unused code

    • restrict support to h5 and cool matrices, except hicConvertFormat

  • hicFindTADs: Option to run computation per specified chromosomes

  • hicPlotTADs: removed code and calls pyGenomeTracks

  • hicAverageRegions: Sum up in a given range around defined reference points. Useful to detect changes in TAD structures between different samples.

  • hicPlotAverageRegions: Plots such a average region

  • hicTransform: Restructuring the source code, remove of option ‘all’ because it was generating confusion. Adding option ‘exp_obs’, exp_obs_norm and exp_obs_lieberman. These three different options use different expectation matrix computations.

  • hicPCA

    • Adding –norm option to compute the expected matrix in the way HOMER is doing it. Useful for drosophila genomes

    • Adding option to write out the intermediate matrices ‘obs_exp’ and ‘pearson’ which are necessary in the computation of the PCA

  • hicPlotMatrix

    • Add option to clip bigwig values

    • Add option to scale bigwig values

  • Removed hicLog2Ration, functionality is covered by hicCompareMatrices

  • Extending test cases to cover more source code and be hopefully more stable.

  • Many small bugfixes

Publication

13 June 2018

We are proud to announce our latest publication:

Joachim Wolff, Vivek Bhardwaj, Stephan Nothjunge, Gautier Richard, Gina Renschler, Ralf Gilsbach, Thomas Manke, Rolf Backofen, Fidel Ramírez, Björn A Grüning. “Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization”, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W11–W16, doi: https://doi.org/10.1093/nar/gky504

Release 2.1.4

25 May 2018

  • cooler file format correction factors are applied as they should be

  • parameter ‘–region’ of hicBuildMatrix works with Python 3

Release 2.1.3

7 May 2018

The third bugfix release of version 2.1 corrects an error in hicPlotViewpoint. It adds a feature requested in issue #169 which should have been included in release 2.1 but was accidentally not.

From 2.1 release note: hicPlotViewpoint: Adds a feature to plot multiple matrices in one image

Release 2.1.2

26 April 2018

The second bug fix release of 2.1 includes:

  • documentation improvements

  • fixing broken Readthedocs documentation

  • Small bug fix concerning hicPlotMatrix and cooler: –chromosomeOrder is now possible with more than one chromosome

  • Small fixes concerning updated dependencies: Fixing version number a bit more specific and not that strict in test cases delta values.

Release 2.1.1

27 March 2018

This release fixes a problem related to python3 in which chromosome names were of bytes type

Release 2.1

5 March 2018

The 2.1 version of HiCExplorer comes with new features and bugfixes.

  • Adding the new feature hicAggregateContacts: A tool that allows plotting of aggregated Hi-C sub-matrices of a specified list of positions.

  • Many improvements to the documentation and the help text. Thanks to Gina Renschler and Gautier Richard from the MPI-IE Freiburg, Germany.

  • hicPlotMatrix

    • supports only bigwig files for an additional data track.

    • the argument –pca was renamed to –bigwig

    • Smoothing the bigwig values to neighboring bins if no data is present there

    • Fixes to a bug concerning a crash of tight_layout

    • Adding the possibility to flip the sign of the values of the bigwig track

    • Adding the possibility to scale the values of the bigwig track

  • hicPlotViewpoint: Adds a feature to plot multiple matrices in one image

  • cooler file format

    • supports mcool files

    • applies correction factors if present

    • optionally reads bin[‘weight’]

  • fixes

    • a crash in hicPlotTads if horizontal lines were used

    • checks if all characters of a title are ASCII. If not they are converted to the closest looking one.

  • Updated and fixate version number of the dependencies

Release 2.0

December 21, 2017

This release makes HiCExplorer ready for the future:

  • Python 3 support

  • Cooler file format support

  • A/B comparment analysis

  • Improved visualizations

  • bug fixes for --perChr option in hicPlotMatrix

  • eigenvector track with --pca for hicPlotMatrix

  • visualization of interactions around a reference point or region with hicPlotViewpoint

  • Higher test coverage

  • re-licensing from GPLv2 to GPLv3

Release 1.8.1

November 27, 2017

Bug fix release:

  • a fix concerning the handling chimeric alignments in hicBuildMatrix. Thanks to Aleksander Jankowski @ajank

  • handling of dangling ends was too strict

  • improved help message in hicBuildMatrix

Release 1.8

October 25, 2017

This release is adding new features and fixes many bugs:

  • hicBuildMatrix: Added multicore support, new parameters –threads and –inputBufferSize

  • hicFindTADs:

  • One call instead of two: hicFindTADs TAD_score and hicFindTADs find_TADs merged to hicFindTADs.

  • New multiple correction method supported: False discovery rate. Call it with –correctForMultipleTesting fdr and –threshold 0.05.

  • Update of the tutorial: mES-HiC analysis.

  • Additional test cases and docstrings to improve the software quality

  • Fixed a bug occurring with bigwig files with frequent NaN values which resulted in only NaN averages

  • hicPlotTADs: Support for plotting points

  • Moved galaxy wrappers to https://github.com/galaxyproject/tools-iuc

  • Fixed multiple bugs with saving matrices

  • hicCorrelate: Changes direction of dendograms to left

Release 1.7.2

April 3, 2017

  • Added option to plot bigwig files as a line hicPlotTADs

  • Updated documentation

  • Improved hicPlotMatrix –region output

  • Added compressed matrices. In our tests the compressed matrices are significantly smaller.

March 28, 2017

Release 1.7

March 28, 2017

This release adds a quality control module to check the results from hicBuildMatrix. By default, now hicBuildMatrix generates a HTML page containing the plots from the QC measures. The results from several runs of hicBuildMatrix can be combined in one page using the new tool hicQC.

Also, this release added a module called hicCompareMatrices that takes two Hi-C matrices and computes the difference, the ratio or the log2 ratio. The resulting matrix can be plotted with hicPlotMatrix to visualize the changes.

Preprint introducing HiCExplorer is now online

March 8, 2017

Our #biorXiv preprint on DNA sequences behind Fly genome architecture is online!

Read the article here : http://biorxiv.org/content/early/2017/03/08/115063

In this article, we introduce HiCExplorer : Our easy to use tool for Hi-C data analysis, also available in Galaxy.

We also introduce HiCBrowser : A standalone software to visualize Hi-C along with other genomic datasets.

Based on HiCExplorer and HiCBrowser, we built a useful resource for anyone to browse and download the chromosome conformation datasets in Human, Mouse and Flies. It’s called the chorogenome navigator

Along with these resources, we present an analysis of DNA sequences behind 3D genome of Flies. Using high-resolution Hi-C analysis, we find a set of DNA motifs that characterize TAD boundaries in Flies and show the importance of these motifs in genome organization.

We hope that these resources and analysis would be useful for the community and welcome any feedback.

HiCExplorer wins best poster prize at VizBi2016

March 20, 2016

We are excited to announce that HiCExplorer has won the NVIDIA Award for Best Scientific Poster in VizBi2016, the international conference on visualization of biological data.

Read more here

This was our poster :

HiCExplorer

Citation

Please cite HiCExplorer as follows:

Joachim Wolff, Leily Rabbani, Ralf Gilsbach, Gautier Richard, Thomas Manke, Rolf Backofen, Björn A Grüning. Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization, Nucleic Acids Research, Nucleic Acids Research, Volume 48, Issue W1, 02 July 2020, Pages W177–W184, https://doi.org/10.1093/nar/gkaa220

Joachim Wolff, Vivek Bhardwaj, Stephan Nothjunge, Gautier Richard, Gina Renschler, Ralf Gilsbach, Thomas Manke, Rolf Backofen, Fidel Ramírez, Björn A Grüning. Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W11–W16, doi: https://doi.org/10.1093/nar/gky504

Fidel Ramirez, Vivek Bhardwaj, Jose Villaveces, Laura Arrigoni, Bjoern A Gruening,Kin Chung Lam, Bianca Habermann, Asifa Akhtar, Thomas Manke. “High-resolution TADs reveal DNA sequences underlying genome organization in flies”. Nature Communications, Volume 9, Article number: 189 (2018), doi: https://doi.org/10.1038/s41467-017-02525-w

_images/logo_mpi-ie.jpg

This tool suite is developed by the Bioinformatics Unit at the Max Planck Institute for Immunobiology and Epigenetics, Freiburg and by the Bioinformatics Lab of the Albert-Ludwigs-University Freiburg, Germany.