hicCorrectMatrix

This function provides 2 balancing methods which can be applied on a raw matrix.

I. KR: It balances a matrix using a fast balancing algorithm introduced by Knight and Ruiz (2012).

II. ICE: Iterative correction of a Hi-C matrix (see Imakaev et al. 2012 Nature Methods for details). For this method to work correctly, bins with zero reads assigned to them should be removed as they cannot be corrected. Also, bins with low number of reads should be removed, otherwise, during the correction step, the counts associated with those bins will be amplified (usually, zero and low coverage bins tend to contain repetitive regions). Bins with extremely high number of reads can also be removed from the correction as they may represent copy number variations.

To aid in the identification of bins with low and high read coverage, the histogram of the number of reads can be plotted together with the Median Absolute Deviation (MAD).

It is recommended to run hicCorrectMatrix as follows:

$ hicCorrectMatrix diagnostic_plot –matrix hic_matrix.h5 -o plot_file.png

Then, after revising the plot and deciding on the threshold values:

$ hicCorrectMatrix correct –correctionMethod ICE –matrix hic_matrix.h5 –filterThreshold <lower threshold> <upper threshold> -o corrected_matrix

For a more in-depth review of how to determine the threshold values, please visit: http://hicexplorer.readthedocs.io/en/latest/content/example_usage.html#correction-of-hi-c-matrix

We recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

usage: hicCorrectMatrix [-h] [--version]  ...

Named Arguments

--version

show program’s version number and exit

Options

Possible choices: diagnostic_plot, correct

To get detailed help on each of the options:

$ hicCorrectMatrix diagnostic_plot -h

$ hicCorrectMatrix correct -h

Sub-commands:

diagnostic_plot

Plots a histogram of the coverage per bin together with the

modified z-score based on the median absolute deviation method

(see Boris Iglewicz and David Hoaglin 1993, Volume 16: How to Detect and Handle Outliers The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor).

hicCorrectMatrix diagnostic_plot --matrix hic_matrix.h5 -o file.png

Required arguments

--matrix, -m

Name of the Hi-C matrix to correct in .h5 format.

--plotName, -o

File name to save the diagnostic plot.

Optional arguments

--chromosomes

List of chromosomes to be included in the iterative correction. The order of the given chromosomes will be then kept for the resulting corrected matrix.

--xMax

Max value for the x-axis in counts per bin.

--perchr

Compute histogram per chromosome. For samples from cells with uneven number of chromosomes and/or translocations it is advisable to check the histograms per chromosome to find the most conservative filterThreshold.

Default: False

--verbose

Print processing status.

Default: False

correct

Run Knight-Ruiz matrix balancing algorithm (KR) or the iterative matrix correction (ICE) .

hicCorrectMatrix correct --matrix hic_matrix.h5 --filterThreshold -1.2 5 (Only if ICE)-out corrected_matrix.h5

Required arguments

--matrix, -m

Name of the Hi-C matrix to correct in .h5 format.

--outFileName, -o

File name to save the resulting matrix. The output is a .h5 file.

Optional arguments

--correctionMethod

Possible choices: KR, ICE

Method to be used for matrix correction. It can be set to KR or ICE (Default: “KR”).

Default: “KR”

--filterThreshold, -t

Removes bins of low or large coverage. Usually these bins do not contain valid Hi-C data or represent regions that accumulate reads and thus must be discarded. Use hicCorrectMatrix diagnostic_plot to identify the modified z-value thresholds. A lower and upper threshold are required separated by space, e.g. –filterThreshold -1.5 5. Applied only for ICE!

--iterNum, -n

Number of iterations to compute.only for ICE! (Default: 500).

Default: 500

--inflationCutoff

Value corresponding to the maximum number of times a bin can be scaled up during the iterative correction. For example, an inflation cutoff of 3 will filter out all bins that were expanded 3 times or more during the iterative correctionself.Only for ICE!

--transCutoff, -transcut

Clip high counts in the top -transcut trans regions (i.e. between chromosomes). A usual value is 0.05. Only for ICE!

--sequencedCountCutoff

Each bin receives a value indicating the fraction that is covered by reads. A cutoff of 0.5 will discard all those bins that have less than half of the bin covered. Only for ICE!

--chromosomes

List of chromosomes to be included in the iterative correction. The order of the given chromosomes will be then kept for the resulting corrected matrix

--skipDiagonal, -s

If set, diagonal counts are not included. Only for ICE!

Default: False

--perchr

Normalize each chromosome separately. This is useful for samples from cells with uneven number of chromosomes and/or translocations.

Default: False

--verbose

Print processing status.

Default: False

--version

show program’s version number and exit

With HiCExplorer version 3.0 we offer an additional Hi-C interaction matrix correction algorithm: Knight-Ruiz.

$ hicCorrectMatrix correct --matrix matrix.cool --correctionMethod KR --chromosomes chrUextra chr3LHet --outFileName corrected_KR.cool

The iterative correction can be used via:

$ hicCorrectMatrix correct --matrix matrix.cool --correctionMethod ICE --chromosomes chrUextra chr3LHet --iterNum 500  --outFileName corrected_ICE.cool --filterThreshold -1.5 5.0

HiCExplorer version 3.1 changes the way data is transferred from Python to C++ for the KR correction algorithm. With these changes the following runtime and peak memory usage on Rao 2014 GM12878 primary + replicate data is possible:

  • KR on 25kb: 165 GB, 1:08 h

  • ICE on 25kb: 224 GB, 3:10 h

  • KR on 10kb: 228 GB, 1:42 h

  • ICE on 10kb: 323 GB, 4:51 h

  • KR on 1kb: 454 GB, 16:50 h

  • ICE on 1kb: >600 GB, > 2.5 d (we interrupted the computation and strongly recommend to use KR on this resolution)

For HiCExplorer versions <= 3.0 KR performs as follows:

  • KR on 25kb: 159 GB, 57:11 min

  • KR on 10kb: >980 GB, – (out of memory on 1TB node, we do not have access to a node with more memory on our cluster)