HiCExplorer

Set of programs to process, normalize, analyze and visualize Hi-C and cHi-C data

HiCExplorer addresses the common tasks of Hi-C data analysis from processing to visualization.

_images/hicex3.png

Availability

HiCExplorer is available as a command line suite of tools on this GitHub repository.

A Galaxy HiCExplorer version is directly available to users at http://hicexplorer.usegalaxy.eu. Training material is available at the Galaxy Training Network, while a Galaxy Tour is available here for users not familiar with this platform. Galaxy HiCExplorer is also available as a Docker image at the Docker Galaxy HiCExplorer GitHub repository. Finally, this Galaxy version is available on the Galaxy Tool Shed and on the corresponding GitHub repository.

The following is the list of tools available in HiCExplorer

tool

description

hicFindRestSites

Identifies the genomic locations of restriction sites

hicBuildMatrix

Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads

hicQuickQC

Estimates the quality of Hi-C dataset

hicQC

Plots QC measures from the output of hicBuildMatrix

hicCorrectMatrix

Uses iterative correction to remove biases from a Hi-C matrix

hicDetectLoops

Identifies enriched Hi-C contacts

hicCorrelate

Computes and visualizes the correlation of Hi-C matrices

hicFindTADs

Identifies Topologically Associating Domains (TADs)

hicMergeDomains

Merges TADs of different resolutions

hicDifferentialTAD

Computes differential TADs between two Hi-C samples

hicPCA

Computes for A / B compartments the eigenvectors

hicTransform

Computes a obs_exp matrix like Lieberman-Aiden (2009), a pearson correlation matrix and or a covariance matrix. These matrices can be used for plotting.

hicMergeMatrixBins

Merges consecutive bins on a Hi-C matrix to reduce resolution

hicMergeTADbins

Uses a BED file of domains or TAD boundaries to merge the bin counts of a Hi-C matrix.

hicPlotDistVsCounts

Plot the decay in interaction frequency with distance

hicPlotMatrix

Plots a Hi-C matrix as a heatmap

hicPlotTADs

Plots TADs as a track that can be combined with other tracks (genes, signal, interactions)

hicPlotViewpoint

A plot with the interactions around a reference point or region.

hicAggregateContacts

A tool that allows plotting of aggregated Hi-C sub-matrices of a specified list of positions.

hicSumMatrices

Adds Hi-C matrices of the same size

hicPlotDistVsCounts

Plots distance vs. Hi-C counts of corrected data

hicInfo

Shows information about a Hi-C matrix file (no. of bins, bin length, sum, max, min, etc)

hicCompareMatrices

Computes difference or ratio between two matrices

hicAverageRegions

Computes the average of multiple given regions, usually TAD regions

hicPlotAverageRegions

visualization of hicAverageRegions

hicNormalize

Normalizes the given matrices to 0-1 range or the smallest read coverage

hicConvertFormat

Converts between different Hi-C interaction matrices

hicAdjustMatrix

Keeps, removes or masks regions in a Hi-C matrix

hicValidateLocations

Compare the loops with known peak protein locations

hicMergeLoops

Merges loops of different resolutions

hicCompartmentalization

Compute the global compartmentalization signal

chicQualityControl

Quality control for cHi-C data

chicViewpointBackgroundModel

Background model computation for cHi-C analysis

chicViewpoint

Computation of all viewpoints based on background model for cHi-C analysis

chicSignificantInteractions

Detection of significant interactions per viewpoint based on background model

chicAggregateStatistic

Compiling of target regions for two samples as input for differential analysis

chicDifferentialTest

Differential analysis of interactions of two samples

chicPlotViewpoint

Plotting of viewpoint with background model and highlighting of significant and differential regions

hicPlotSVL

Computing short vs long range contacts and plotting the results

hicHyperoptDetectLoops

Search for optimal hicDectectLoops parameters

hicHyperoptDetectLoopsHiCCUPS

Search for optimal Juicer HiCCUPS parameters

hicInterIntraTAD

Extras and computes inter and intra TAD data

hicTrainTADClassifier

Train a classifier to predict TADs with hicTADClassifier

hicTADClassifier

Predict TAD boundaries with hicTADClassifier

Getting Help

  • For all kind of questions, suggesting changes/enhancements and to report bugs, please create an issue on our GitHub repository

  • In the past we offered to post on Biostars with Tag hicexplorer : Biostars or on the deepTools mailing list. We still check these resources from time to time but the preferred way to communicate are GitHub issues.

Contents:

Installation

Requirements

  • python >= 3.6

  • numpy >= 1.19.*

  • scipy >= 1.5.*

  • matplotlib-base >= 3.1.*

  • ipykernel >= 5.3.0

  • pysam >= 0.16.*

  • intervaltree >= 3.1.*

  • biopython

  • pytables >= 3.6.*

  • pandas >= 1.1.*

  • pybigwig >= 0.3.*

  • jinja2 >= 2.11

  • unidecode >= 1.1.*

  • hicmatrix >= 15

  • hic2cool >= 0.8.3

  • psutil >= 5.7.*

  • pygenometracks >= 3.5

  • fit_nbinom >= 1.1

  • cooler >= 0.8.10

  • krbalancing >= 0.0.5 (Needs the library eigen; openmp is recommended for linux users. No openmp support on macOS.)

  • pybedtools >= 0.8.*

  • future >= 0.18

  • tqdm >= 4.50

  • hyperopt >= 0.2.4

  • python-graphviz >= 0.14

Warning: Python 2.7 support is discontinued. Moreover, the support for pip is discontinued too. Warning: We strongly recommend to use the conda package manager and will no longer give support on all issues raising with pip.

Command line installation using conda

The fastet way to obtain Python 3.6 or 3.7 together with numpy and scipy is via the Anaconda Scientific Python Distribution. Just download the version that’s suitable for your operating system and follow the directions for its installation. All of the requirements for HiCExplorer can be installed in Anaconda with:

$ conda install hicexplorer -c bioconda -c conda-forge

We strongly recommended to use conda to install HiCExplorer.

Command line installation using pip

The installation via pip is discontinued with version 3.0. The reason for this is that we want to provide a ‘one-click’ installation. However, with version 3.0 we added the C++ library eigen as dependency and pip does not support non-Python packages.

For older versions you can still use pip: Install HiCExplorer using the following command:

$ pip install hicexplorer

All python requirements should be automatically installed.

If you need to specify a specific path for the installation of the tools, make use of pip install’s numerous options:

$ pip install --install-option="--prefix=/MyPath/Tools/hicexplorer" git+https://github.com/deeptools/HiCExplorer.git

Warning: It can be that you have to install additional packages via your system package manager to successfully install HiCExplorer via pip. Warning: We strongly recommend to use the conda package manager and will no longer give support on all issues raising with pip.

Command line installation without pip

You are highly recommended to use pip rather than these more complicated steps.

  1. Install the requirements listed above in the “requirements” section. This is done automatically by pip.

2. Download source code

$ git clone https://github.com/deeptools/HiCExplorer.git

or if you want a particular release, choose one from https://github.com/deeptools/HiCExplorer/releases:

$ wget https://github.com/deeptools/HiCExplorer/archive/1.5.12.tar.gz
$ tar -xzvf

3. To install the source code (if you don’t have root permission, you can set a specific folder using the --prefix option)

$ python setup.py install --prefix /User/Tools/hicexplorer

Galaxy installation

HiCExplorer can be easily integrated into a local Galaxy, the wrappers are provided at the Galaxy tool shed.

Installation with Docker

The HiCExplorer Galaxy instance is also available as a docker container, for those wishing to use the Galaxy framework but who also prefer a virtualized solution. This container is quite simple to install:

$ sudo docker pull quay.io/bgruening/galaxy-hicexplorer

To start and otherwise modify this container, please see the instructions on the docker-galaxy-stable github repository. Note that you must use bgruening/galaxy-hicexplorer in place of bgruening/galaxy-stable in the examples, as the HiCExplorer Galaxy container is built on top of the galaxy-stable container.

Tip

For support, or feature requests contact: deeptools@googlegroups.com

HiCExplorer tools

Tools for Hi-C data pre-processing

hicFindRestSites
Further usage

In case multiple restriction enzymes are used in one experiment, hicFindRestSite can be used to find restriction sites individually per enzyme. Afterwards, all output bed files should be combined. However, it should be noted that the QC report will not be correct for this specific usage.

hicBuildMatrix

Using an alignment from a program that supports local alignment (eg. Bowtie2) where both PE reads are mapped using the –local option, this program reads such file and creates a matrix of interactions.

usage: hicBuildMatrix --samFiles two sam files two sam files --outFileName
                      FILENAME --QCfolder FOLDER --restrictionCutFile BED file
                      [BED file ...] --restrictionSequence RESTRICTIONSEQUENCE
                      [RESTRICTIONSEQUENCE ...] --danglingSequence
                      DANGLINGSEQUENCE [DANGLINGSEQUENCE ...]
                      [--outBam bam file] [--binSize BINSIZE [BINSIZE ...]]
                      [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE]
                      [--maxLibraryInsertSize MAXLIBRARYINSERTSIZE]
                      [--genomeAssembly GENOMEASSEMBLY]
                      [--region CHR:START-END] [--keepSelfLigation]
                      [--keepSelfCircles]
                      [--minMappingQuality MINMAPPINGQUALITY]
                      [--threads THREADS] [--inputBufferSize INPUTBUFFERSIZE]
                      [--doTestRun] [--doTestRunLines DOTESTRUNLINES]
                      [--skipDuplicationCheck] [--chromosomeSizes txt file]
                      [--help] [--version]
Required arguments
--samFiles, -s

The two PE alignment sam files to process

--outFileName, -o

Output file name for the Hi-C matrix.

--QCfolder

Path of folder to save the quality control data for the matrix. The log files produced this way can be loaded into hicQC in order to compare the quality of multiple Hi-C libraries.

--restrictionCutFile, -rs

BED file(s) with all restriction cut sites (output of “hicFindRestSite” command). Should only contain the restriction sites of the same genome which has been used to generate the input sam files. Using regions of a different genome version can generate false results! To use more than one restriction enzyme, generate a restrictionCutFile for each enzyne and list them space seperated.

--restrictionSequence, -seq

Sequence of the restriction site, if multiple are used, please list them space seperated. If a dangling sequence is listed at the same time, please preserve the same order.

--danglingSequence

Sequence left by the restriction enzyme after cutting, if multiple are used, please list them space seperated and preserve the order. Each restriction enzyme recognizes a different DNA sequence and, after cutting, they leave behind a specific “sticky” end or dangling end sequence. For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT. For DpnII, the restriction site and dangling end sequence are the same: GATC. This information is easily found on the description of the restriction enzyme. The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads. A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol.

Optional arguments
--outBam, -b

Output bam file to process. Optional parameter. A bam file containing all valid Hi-C reads can be created using this option. This bam file could be useful to inspect the distribution of valid Hi-C reads pairs or for other downstream analyses, but is not used by any HiCExplorer tool. Computation will be significantly longer if this option is set.

--binSize, -bs

Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. If not given, matrices of restriction site resolution will be built. Optionally for mcool file format: Define multiple resolutions which are all a multiple of the first value. Example: –binSize 10000 20000 50000 will create a mcool file formate containing the three defined resolutions.

--minDistance

Minimum distance between restriction sites. Restriction sites that are closer than this distance are merged into one. This option only applies if –restrictionCutFile is given (Default: 300).

Default: 300

--maxDistance

This parameter is now obsolete. Use –maxLibraryInsertSize instead

--maxLibraryInsertSize

The maximum library insert size defines different cut offs based on the maximum expected library size. This is not the average fragment size but the higher end of the the fragment size distribution (obtained using for example a Fragment Analyzer or a Bioanalyzer) which usually is between 800 to 1500 bp. If this value is not known use the default value. The insert value is used to decide if two mates belong to the same fragment (by checking if they are within this max insert size) and to decide if a mate is too far away from the nearest restriction site (Default: 1000).

Default: 1000

--genomeAssembly, -ga

The genome the reads were mapped to. Used for metadata of cool file.

--region, -r

Region of the genome to limit the operation to. The format is chr:start-end. It is also possible to just specify a chromosome, for example –region chr10

--keepSelfLigation

If set, inward facing reads less than 1000 bp apart and having a restrictionsite in between are removed. Although this reads do not contribute to any distant contact, they are useful to account for bias in the data (for the moment is always True).

Default: False

--keepSelfCircles

If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots.

Default: False

--minMappingQuality

minimum mapping quality for reads to be accepted. Because the restriction enzyme site could be located on top of the read, this may reduce the reported quality of the read. Thus, this parameter may be adjusted if too many low quality (but otherwise perfectly valid Hi-C reads) are found. A good strategy is to make a test run (using the –doTestRun), then checking the results to see if too many low quality reads are present and then using the bam file generated to check if those low quality reads are caused by the read not being mapped entirely (Default: 15).

Default: 15

--threads

Number of threads. Using the python multiprocessing module. One master process which is used to read the input file into the buffer and one process which is merging the output bam files of the processes into one output bam file. All other threads do the actual computation. Minimum value for the ‘–thread’ parameter is 2. The usage of 8 threads is optimal if you have an HDD. A higher number of threads is only useful if you have a fast SSD. Have in mind that the performance of hicBuildMatrix is influenced by the number of threads, the speed of your hard drive and the inputBufferSize. To clarify: the performance with a higher thread number is not negative influenced but not positive too. With a slow HDD and a high number of threads many threads will do nothing most of the time (Default: 4).

Default: 4

--inputBufferSize

Size of the input buffer of each thread. 400,000 read pairs per input file per thread is the default value. Reduce this value to decrease memory usage.

Default: 400000

--doTestRun

A test run is useful to test the quality of a Hi-C experiment quickly. It works by testing only 1,000,000 reads. This option is useful to get an idea of quality control values like inter-chromosomal interactions, duplication rates etc.

Default: False

--doTestRunLines

Number of lines to consider for the qc test run (Default: 1000000).

Default: 1000000

--skipDuplicationCheck

Identification of duplicated read pairs is memory consuming. Thus, in case of memory errors this check can be skipped. However, consider running a –doTestRun first to get an estimation of the duplicated reads.

Default: False

--chromosomeSizes, -cs

File with the chromosome sizes for your genome. A tab-delimited two column layout “chr_name size” is expectedUsually the sizes can be determined from the SAM/BAM input files, however, for cHi-C or scHi-C it can be that at the start or end no data is present. Please consider that this option causes that only reads are considered which are on the listed chromosomes.Use this option to guarantee fixed sizes. An example file is available via UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/dm3.chrom.sizes

--version

show program’s version number and exit

Please note that the file type extension for the output matrix (--outFileName) must be given! This can be .h5, .cool or the specializations of cool .mcool, if the path is given! For .scool files please create one .cool file per cell and merge it together with scHiCExplorer’s scHicMergeToSCool.

Building multicooler matrices

hicBuildMatrix supports building multicooler matrices which are for example needed for visualization with HiGlass. To do so, use as out file format either .cool or .mcool and define the desired resolutions as –binSize. hicBuildMatrix builds the interaction matrix for the highest resolution and merges the bins for the lower resolutions. The lower resolutions need to be an integer multiplicative of the highest resolution.

$ hicBuildMatrix -s forward.bam reverse.bam -o multi_resolution.cool
  --binSize 10000 20000 50000 100000 --QCfolder QC

Introducing with version 3.5 we support multiple restriction and dangling end sequences, and multiple restriction cut site files. Hi-C protocols that use multiple restriction cut enzymes benefit from this and get now an improved QC report. Version 3.5 adds also the support for a chromosome size file which can help to get interaction matrices with a predefined size. Capture Hi-C or single-cell Hi-C data, where it is not guaranteed that reads from all areas of the chromosome are present benefit from this latest improvement.

Missing scaffolds or contigs in the Hi-C matrix

Restriction enzymes cut the DNA at their specific restriction cut site sequences. It can occur for scaffolds or contigs (less likely it happens for chromosomes) that it does not contain any of these cut sites. In this case, the reads from this scaffold or contig are considered as invalid and are part of the ‘same restriction fragment’-statistics in the QC report.

hicSumMatrices

Adds Hi-C matrices of the same size. Format has to be hdf5 (.h5) or npz. In order to minimize the the loss of information, it is recommended to to sum uncorrected matrices (before hicCorrectMatrix).

usage: hicSumMatrices --matrices .h5 or cooler file format
                      [.h5 or cooler file format ...] --outFileName
                      OUTFILENAME [-h] [--version]
Required arguments
--matrices, -m

Space-delimited names of the matrices to add. The matrices must have the same shape/size. You can verify their size by using hicInfo.

--outFileName, -o

File name to save the resulting matrix. The output is from the same file type as the input. Please add the file ending suffix (either .h5 or .cool), if it is not given, there will be no output.

Optional arguments
--version

show program’s version number and exit

hicMergeMatrixBins
Background

Depending on the downstream analyses to perform on a Hi-C matrix generated with HiCExplorer, one might need different bin resolutions. For example using hicPlotMatrix to display chromatin interactions of a whole chromosome will not produce any meaningful visualization if it is performed on a matrix at restriction sites resolution. hicMergeMatrixBins address this issue by merging a given number of adjacent bins (determined by --numBins). To minimize the loss of information (matrix bins with few counts), it is recommended to perform hicMergeMatrixBins on uncorrected matrices (before matrix correction). The matrix correction can be performed after bin merging for downstream analyses using hicCorrectMatrix.

Description

Merges bins from a Hi-C matrix. For example, using a matrix containing 5kb bins, a matrix of 50kb bins can be derived using –numBins 10. From one type of downstream analysis to another, different bin sizes are used. For example to call TADs, unmerged matrices are recommended while to display Hi-C matrices, bins of approximately 2000bp usually yield the best representations with hicPlotMatrix for small regions, and even larger bins (50kb) are recommended for whole chromosome representations or for hicPlotDistVsCounts.

usage: hicMergeMatrixBins --matrix matrix.h5 --outFileName OUTFILENAME
                          --numBins int [--runningWindow] [--help] [--version]
Required arguments
--matrix, -m

Matrix to reduce in h5 format.

--outFileName, -o

File name to save the resulting matrix. The output is also a .h5 file. But don’t add the suffix.

--numBins, -nb

Number of bins to merge.

Optional arguments
--runningWindow

Set to merge for using a running window of length –numBins.

Default: False

--version

show program’s version number and exit

Usage example
Running hicMergeMatrixBins

Below, we display the example of plotting a Hi-C matrix at the scale of the whole X-chromosome and at the scale of a 1Mb region of the X chromosome. To do this, we will perform two different bin merging using hicMergeMatrixBins on an uncorrected matrix built at restriction site resolution using hicBuildMatrix. To do this, we run the two following commands:

$ hicMergeMatrixBins -m myMatrix.h5 -o myMatrix_merged_nb3.h5 -nb 3

$ hicMergeMatrixBins -m myMatrix.h5 -o myMatrix_merged_nb50.h5 -nb 50

Starting from a matrix myMatrix.h5 with bins of a median length of 529bp, the first command produces a matrix myMatrix_merged_nb3.h5 with bins of a median length of 1661bp while the second command produces a matrix myMatrix_merged_nb50.h5 with bins of a median length of 29798bp. This can be checked with hicInfo.

After correction of these three matrices using hicCorrectMatrix, we can now plot the corrected matrices myMatrix_corrected.h5, myMatrix_merged_nb3_corrected.h5 and myMatrix_merged_nb50_corrected.h5 at the scale of the whole X-chromosome and at the X:2000000-3000000 region to see the effect of bin merging on the interactions visualization.

Effect of bin merging at the scale of a chromosome
$ hicPlotMatrix -m myMatrix_corrected.h5 \
-o myMatrix_corrected_Xchr.png \
--chromosomeOrder X \
-t Restriction_sites_resolution --log1p \
--clearMaskedBins

$ hicPlotMatrix -m myMatrix_merged_nb3_corrected.h5 \
-o myMatrix_merged_nb3_corrected_Xchr.png \
--chromosomeOrder X \
-t Bins_merged_by_3 --log1p \
--clearMaskedBins

 $ hicPlotMatrix -m myMatrix_merged_nb50_corrected.h5 \
-o myMatrix_merged_nb50_corrected_Xchr.png \
--chromosomeOrder X \
-t Bins_merged_by_50 --log1p \
--clearMaskedBins

When observed altogether, the plots produced by these three commands show that merging of bins by 50 is the most adequate way to plot interactions for a whole chromosome in Drosophila melanogaster when starting from a matrix with bins of a median length of 529bp.

_images/hicMergeMatrixBins_Xchr.png
Effect of bin merging at the scale of a specific region
 $ hicPlotMatrix -m myMatrix_corrected.h5 \
-o myMatrix_corrected_Xregion.png \
--region X:2000000-3000000 \
-t Restriction_sites_resolution --log1p \
--clearMaskedBins

$ hicPlotMatrix -m myMatrix_merged_nb3_corrected.h5 \
-o myMatrix_merged_nb3_corrected_Xregion.png \
--region X:2000000-3000000 \
-t Bins_merged_by_3 --log1p \
--clearMaskedBins

 $ hicPlotMatrix -m myMatrix_merged_nb50_corrected.h5 \
-o myMatrix_merged_nb50_corrected_Xregion.png \
--region X:2000000-3000000 \
-t Bins_merged_by_50 --log1p \
--clearMaskedBins

When observed altogether, the plots produced by these three commands show that merging of bins by 3 is the most adequate way to plot interactions for a region of 1Mb in Drosophila melanogaster when starting from a matrix with bins of a median length of 529bp.

_images/hicMergeMatrixBins_Xregion.png
hicCorrectMatrix

This function provides 2 balancing methods which can be applied on a raw matrix.

I. KR: It balances a matrix using a fast balancing algorithm introduced by Knight and Ruiz (2012).

II. ICE: Iterative correction of a Hi-C matrix (see Imakaev et al. 2012 Nature Methods for details). For this method to work correctly, bins with zero reads assigned to them should be removed as they cannot be corrected. Also, bins with low number of reads should be removed, otherwise, during the correction step, the counts associated with those bins will be amplified (usually, zero and low coverage bins tend to contain repetitive regions). Bins with extremely high number of reads can also be removed from the correction as they may represent copy number variations.

To aid in the identification of bins with low and high read coverage, the histogram of the number of reads can be plotted together with the Median Absolute Deviation (MAD).

It is recommended to run hicCorrectMatrix as follows:

$ hicCorrectMatrix diagnostic_plot –matrix hic_matrix.h5 -o plot_file.png

Then, after revising the plot and deciding on the threshold values:

$ hicCorrectMatrix correct –correctionMethod ICE –matrix hic_matrix.h5 –filterThreshold <lower threshold> <upper threshold> -o corrected_matrix

For a more in-depth review of how to determine the threshold values, please visit: http://hicexplorer.readthedocs.io/en/latest/content/example_usage.html#correction-of-hi-c-matrix

We recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

usage: hicCorrectMatrix [-h] [--version]  ...
Named Arguments
--version

show program’s version number and exit

Options

Possible choices: diagnostic_plot, correct

To get detailed help on each of the options:

$ hicCorrectMatrix diagnostic_plot -h

$ hicCorrectMatrix correct -h

Sub-commands:
diagnostic_plot
Plots a histogram of the coverage per bin together with the

modified z-score based on the median absolute deviation method

(see Boris Iglewicz and David Hoaglin 1993, Volume 16: How to Detect and Handle Outliers The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor).

hicCorrectMatrix diagnostic_plot --matrix hic_matrix.h5 -o file.png
Required arguments
--matrix, -m

Name of the Hi-C matrix to correct in .h5 format.

--plotName, -o

File name to save the diagnostic plot.

Optional arguments
--chromosomes

List of chromosomes to be included in the iterative correction. The order of the given chromosomes will be then kept for the resulting corrected matrix.

--xMax

Max value for the x-axis in counts per bin.

--perchr

Compute histogram per chromosome. For samples from cells with uneven number of chromosomes and/or translocations it is advisable to check the histograms per chromosome to find the most conservative filterThreshold.

Default: False

--verbose

Print processing status.

Default: False

correct

Run Knight-Ruiz matrix balancing algorithm (KR) or the iterative matrix correction (ICE) .

hicCorrectMatrix correct --matrix hic_matrix.h5 --filterThreshold -1.2 5 (Only if ICE)-out corrected_matrix.h5
Required arguments
--matrix, -m

Name of the Hi-C matrix to correct in .h5 format.

--outFileName, -o

File name to save the resulting matrix. The output is a .h5 file.

Optional arguments
--correctionMethod

Possible choices: KR, ICE

Method to be used for matrix correction. It can be set to KR or ICE (Default: “KR”).

Default: “KR”

--filterThreshold, -t

Removes bins of low or large coverage. Usually these bins do not contain valid Hi-C data or represent regions that accumulate reads and thus must be discarded. Use hicCorrectMatrix diagnostic_plot to identify the modified z-value thresholds. A lower and upper threshold are required separated by space, e.g. –filterThreshold -1.5 5. Applied only for ICE!

--iterNum, -n

Number of iterations to compute.only for ICE! (Default: 500).

Default: 500

--inflationCutoff

Value corresponding to the maximum number of times a bin can be scaled up during the iterative correction. For example, an inflation cutoff of 3 will filter out all bins that were expanded 3 times or more during the iterative correctionself.Only for ICE!

--transCutoff, -transcut

Clip high counts in the top -transcut trans regions (i.e. between chromosomes). A usual value is 0.05. Only for ICE!

--sequencedCountCutoff

Each bin receives a value indicating the fraction that is covered by reads. A cutoff of 0.5 will discard all those bins that have less than half of the bin covered. Only for ICE!

--chromosomes

List of chromosomes to be included in the iterative correction. The order of the given chromosomes will be then kept for the resulting corrected matrix

--skipDiagonal, -s

If set, diagonal counts are not included. Only for ICE!

Default: False

--perchr

Normalize each chromosome separately. This is useful for samples from cells with uneven number of chromosomes and/or translocations.

Default: False

--verbose

Print processing status.

Default: False

--version

show program’s version number and exit

With HiCExplorer version 3.0 we offer an additional Hi-C interaction matrix correction algorithm: Knight-Ruiz.

$ hicCorrectMatrix correct --matrix matrix.cool --correctionMethod KR --chromosomes chrUextra chr3LHet --outFileName corrected_KR.cool

The iterative correction can be used via:

$ hicCorrectMatrix correct --matrix matrix.cool --correctionMethod ICE --chromosomes chrUextra chr3LHet --iterNum 500  --outFileName corrected_ICE.cool --filterThreshold -1.5 5.0

HiCExplorer version 3.1 changes the way data is transferred from Python to C++ for the KR correction algorithm. With these changes the following runtime and peak memory usage on Rao 2014 GM12878 primary + replicate data is possible:

  • KR on 25kb: 165 GB, 1:08 h

  • ICE on 25kb: 224 GB, 3:10 h

  • KR on 10kb: 228 GB, 1:42 h

  • ICE on 10kb: 323 GB, 4:51 h

  • KR on 1kb: 454 GB, 16:50 h

  • ICE on 1kb: >600 GB, > 2.5 d (we interrupted the computation and strongly recommend to use KR on this resolution)

For HiCExplorer versions <= 3.0 KR performs as follows:

  • KR on 25kb: 159 GB, 57:11 min

  • KR on 10kb: >980 GB, – (out of memory on 1TB node, we do not have access to a node with more memory on our cluster)

hicNormalize

Normalizes given matrices either to the smallest given read number of all matrices or to 0 - 1 range. However, it does NOT compute the contact probability.

We recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

usage: hicNormalize --matrices MATRICES [MATRICES ...] --normalize
                    {norm_range,smallest,multiplicative} --outFileName
                    FILENAME [FILENAME ...]
                    [--multiplicativeValue MULTIPLICATIVEVALUE]
                    [--setToZeroThreshold SETTOZEROTHRESHOLD] [--help]
                    [--version]
Required arguments
--matrices, -m

The matrix (or multiple matrices) to get information about. HiCExplorer supports the following file formats: h5 (native HiCExplorer format) and cool.

--normalize, -n

Possible choices: norm_range, smallest, multiplicative

Normalize to a) 0 to 1 range, b) all matrices to the lowest read count of the given matrices (Default: “smallest”).

Default: “smallest”

--outFileName, -o

Output file name for the Hi-C matrix.

Optional arguments
--multiplicativeValue, -mv

Value to multiply if –normalize is set to multiplicative. (Default: 1).

Default: 1

--setToZeroThreshold, -sz

A threshold to set all values after normalization to 0 if smaller this threshold. Default value is 0 i.e. there is no effect.It is recommended to set it for the normalize mode “smallest” to 1.0. This parameter will influence the sparsity of the matrix by removing many values close to 0 in smallest normalization mode.

Default: 0.0

--version

show program’s version number and exit

Background

To be able to compare different Hi-C interaction matrices the matrices need to be normalized to a equal level of read coverage or value ranges. hicNormalize accomplish this by offering two modes: 0-1 range normalization or a read count normalization.

Usage example
Normalize to 0-1 range

norm_range mode of hicNormalize normalizes all reads to the 0 to 1 range i.e. the maximum value of the interaction matrix becomes 1 and the minimum value 0.

$ hicNormalize -m matrix.cool --normalize norm_range -o matrix_0_1_range.cool
Normalize to smallest read count

All matrices are normalized in the way the total read count of each matrix is equal to the read count of the matrix with the smallest read count of all input matrices.

Example
  • matrix.cool with a read count of 10000

  • matrix2.cool with a read count of 12010

  • matrix3.cool with a read count of 11000

In this example each entry in matrix2.cool and matrix3.cool are normalized with a factor of 12010 / 10000 respective with 11000 / 10000.

$ hicNormalize -m matrix.cool matrix2.cool matrix3.cool --normalize smallest
  -o matrix_normalized.cool matrix2_normalized.cool matrix3_normalized.cool

Tools for Hi-C QC

hicQuickQC
Background

This tool considers the first 1,000,000 reads (or user defined number) of the mapped bam files to get a quality estimate of the Hi-C data.

Description

The tool hicQuickQC considers the first n lines of two bam/sam files to get a first estimate of the quality of the data. It is highly recommended to set the restriction enzyme and dangling end parameter to get a good quality report.

usage: hicQuickQC --samFiles two sam files two sam files --QCfolder FOLDER
                  --restrictionCutFile BED file [BED file ...]
                  --restrictionSequence RESTRICTIONSEQUENCE
                  [RESTRICTIONSEQUENCE ...] --danglingSequence
                  DANGLINGSEQUENCE [DANGLINGSEQUENCE ...] [--lines LINES]
                  [--help] [--version]
Required arguments
--samFiles, -s

The two PE alignment sam files to process.

--QCfolder

Path of folder to save the quality control data of the matrix. The log files produced this way can be loaded into hicQC in order to compare the quality of multiple Hi-C libraries.

--restrictionCutFile, -rs

BED file(s) with all restriction cut places (output of “findRestSite” command). Should contain only mappable restriction sites. If given, the bins are set to match the restriction fragments (i.e. the region between one restriction site and the next). Alternatively, a fixed binSize can be defined instead. However, either binSize or restrictionCutFile must be defined. To use more than one restriction enzyme, generate for each one a restrictionCutFile and list them space seperated.

--restrictionSequence, -seq

Sequence of the restriction site, if multiple are used, please list them space seperated. If a dangling sequence is listed at the same time, please preserve the same order.

--danglingSequence

Sequence left by the restriction enzyme after cutting, if multiple are used, please list them space seperated and preserve the order. Each restriction enzyme recognizes a different DNA sequence and, after cutting, they leave behind a specific “sticky” end or dangling end sequence. For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT. For DpnII, the restriction site and dangling end sequence are the same: GATC. This information is easily found on the description of the restriction enzyme. The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads. A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol.

Optional arguments
--lines

Number of lines to consider for the QC test run (Default: 1000000).

Default: 1000000

--version

show program’s version number and exit

For more information see encode .

hicQC
Background

This tool can process quality control log files produced by hicBuildMatrix for multiple samples at once to generate summary tables and plots of quality control (QC) measures for all these samples. Additionally, an HTML output is generated where all summary tables and plots are displayed.

Description

Tabulates and plots QC measures from hicBuildMatrix log files within an HTML output

usage: hicQC --logfiles matrix1_QCfolder/QC.log matrix2_QCfolder/QC.log --labels "sample 1" "sample 2" --outputFolder QC_all_samples)
Required arguments
--logfiles, -l

Path to the log files to be processed

--labels

Label to assign to each log file. Each label should be separated by a space. Quote labels that contain spaces: E.g. –labels label1 “labels 2”

--outputFolder, -o

Several files with be saved under this folder: A table containing the results and a html file with several images.

Optional arguments
--dpi

Image resolution. By default high resolution png images with a 200 dpi are created.

Default: 200

--version

show program’s version number and exit

Usage example

hicBuildMatrix generates a QC.log file for each processed Hi-C sample in a folder specified in the --QCfolder argument. The quality control measures stored for each of these samples can be merged in summary tables and plots using hicQC. An example usage is:

$ hicQC --logfiles ./sample_1/QC.log ./sample_2/QC.log /sample_3/QC.log \
--labels "Sample 1" "Sample 2" "Sample 3" \
-o QC_plots

This command will generate several plots that are described in details below.

_images/pairs_sequenced.png

On the plot above, we can see how many reads were sequenced per sample (Sequenced reads), how many reads were mappable, unique and of high quality and how many reads passed all quality controls and are thus useful for further analysis (Hi-C contacts). All quality controls used for read filtering are explained below.

_images/unmappable_and_non_unique.png

The figure above contains the fraction of reads with respect to the total number of reads that did not map, that have a low quality score or that didn’t map uniquely to the genome. In our example we can see that Sample 3 has the highest fraction of Hi-C contacts. We explain the differences between the three samples on the plot below.

_images/pairs_discarded.png

This figure contains the fraction of read pairs (with respect to mappable and unique reads) that were discarded when building the Hi-C matrix. You can find the description of each category below:

Dangling ends: reads that start with the restriction site and constitute reads that were digested but not ligated. Sample 1 in our example has a high fraction of dangling ends (and thus a low proportion of Hi-C contacts). Reasons for this can be inefficient ligation or insufficient removal of danging ends during samples preparation.

Duplicated pairs: reads that have the same sequence due to PCR amplification. For example, Sample 2 was amplified too much and thus has a very high fraction of duplicated pairs.

Same fragment: read mates facing inward, separated by up to 800bp that do not have a restriction enzyme site in between. These read pairs are not valid Hi-C pairs and are thus discarded from further analyses.

Self circle: read pairs within 25kb with ‘outward’ read orientation.

Self ligation: read pairs with a restriction site in between that are within 800bp.

_images/distance.png

The figure above contains the fraction of read pairs (with respect to mappable reads) that compose inter chromosomal, intra short range (< 20kb) or intra long range (>= 20kb) contacts. Inter chromosomal reads of a wild-type sample are expected to be low. Trans-chromosomal contacts can be primarily considered as random ligation events. These would be expected to contribute to technical noise that may obscure some of the finer features in the Hi-C datasets (Nagano et al. 2015, Comparison of Hi-C results using in-solution versus in-nucleus ligation, doi: https://doi.org/10.1186/s13059-015-0753-7). As such, a high fraction of inter chromosomal reads is an indicator of low sample quality, but it can also be associated to cell cycle changes (Nagano et al. 2018, Cell-cycle dynamics of chromosomal organization at single-cell resolution, doi: https://doi.org/10.1038/nature23001).

Short range and long range contacts proportions can be associated to how the fixation is performed during Hi-C sample preparation. These two proportions also directly impact the Hi-C corrected counts versus genomic distance plots generated by hicPlotDistVsCounts.

_images/read_orientation.png

The last figure shows the fractions of the read pair types: inward, outward, left or right read pairs (with respect to mappable reads). Deviations from an equal distribution indicates problems during sample preparation.

hicCorrelate
Background

hicCorrelate is a dedicated Quality Control tool that allows the correlation of multiple Hi-C matrices at once with either a heatmap or scatter plots output.

Description

Computes pairwise correlations between Hi-C matrices data. The correlation is computed taking the values from each pair of matrices and discarding values that are zero in both matrices.Parameters that strongly affect correlations are bin size of the Hi-C matrices and the considered range. The smaller the bin size of the matrices, the finer differences you score. The –range parameter should be selected at a meaningful genomic scale according to, for example, the mean size of the TADs in the organism you work with.

usage: hicCorrelate --matrices MATRICES [MATRICES ...] [--zMin ZMIN]
                    [--zMax ZMAX] [--colorMap] [--plotNumbers]
                    [--method {pearson,spearman}] [--log1p]
                    [--labels sample1 sample2 [sample1 sample2 ...]]
                    [--range RANGE] --outFileNameHeatmap OUTFILENAMEHEATMAP
                    --outFileNameScatter OUTFILENAMESCATTER
                    [--chromosomes CHROMOSOMES [CHROMOSOMES ...]] [--help]
                    [--version]
Required arguments
--matrices, -m

Matrices to correlate (usually .h5 but other formats are allowed). hicCorrelate is better used on un-corrected matrices in order to exclude any changes introduced by the correction.

Heatmap arguments

Options for generating the correlation heatmap

--zMin, -min

Minimum value for the heatmap intensities. If not specified the value is set automatically.

--zMax, -max

Maximum value for the heatmap intensities.If not specified the value is set automatically.

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html (Default: “jet”).

Default: “jet”

--plotNumbers

If set, then the correlation number is plotted on top of the heatmap.

Default: False

Optional arguments
--method

Possible choices: pearson, spearman

Correlation method to use (Default: “pearson”).

Default: “pearson”

--log1p

If set, then the log1p of the matrix values is used. This parameter has no effect for Spearman correlations but changes the output of Pearson correlation and, for the scatter plot, if set, the visualization of the values is easier.

Default: False

--labels, -l

User defined labels instead of default labels from file names. Multiple labels have to be separated by space, e.g. –labels sample1 sample2 sample3

--range

In bp with the format low_range:high_range, for example 1000000:2000000. If –range is given only counts within this range are considered. The range should be adjusted to the size of interacting domains in the genome you are working with.

--outFileNameHeatmap, -oh

File name to save the resulting heatmap plot. Supported file formats are given by matplotlib, usually these are: png, pdf, ps, eps and svg.

Default: “heatmap.png”

--outFileNameScatter, -os

File name to save the resulting scatter plot. Supported file formats are given by matplotlib, usually these are: png, pdf, ps, eps and svg.

Default: “scatter.png”

--chromosomes

List of chromosomes to be included in the correlation.

--version

show program’s version number and exit

Usage example

Below, you can find a correlation example of uncorrected Hi-C matrices obtained from Drosophila melanogaster embryos, either wild-type or having one gene knocked-down by RNAi.

$ hicCorrelate -m Dmel_wt_1.h5 Dmel_wt_2.h5 Dmel_kd_1.h5 Dmel_kd_2.h5 \
--method=pearson --log1p \
--labels Dmel_wt_1 Dmel_wt_2 Dmel_kd_1 Dmel_kd_2 \
--range 5000:200000 \
--outFileNameHeatmap Dmel_heatmap --outFileNameScatter Dmel_scatterplot \
--plotFileFormat png
Heatmap _images/Dmel_heatmap.png

This example is showing a heatmap that was calculated using the Pearson correlation of un-corrected Hi-C matrices with a bin size of 6000 bp. The dendrogram indicates which samples are most similar to each other. You can see that the wild-type samples are separated from the knock-down samples. The second option we offer is calculating the Spearman correlation.

Scatter plot

Additionally, pairwise scatter plots comparing interactions between each sample can be plotted.

_images/Dmel_scatterplot.png
hicPlotDistVsCounts
Background

This tool allows a quick comparison between multiple Hi-C matrices of the Hi-C counts enrichment at different genomic ranges / distances up to whole chromosomes. Biological replicates should display the exact same distribution while samples coming from different cell-lines, treated versus untreated samples or mutant versus wild-type samples can display a different distribution at long and/or close ranges.

The results of this tool usually reflect the proportion of long-range and short-range contacts calculated in each sample by hicQC. Local TAD or contact enrichments will not impact the results computed by this tool, hicPCA is better suited for that purpose.

Description

This program creates distance vs. Hi-C counts plots. It can use several matrix files to compare them at once. If the –perchr option is given, each chromosome is plotted independently. When plotting multiple matrices, denser matrices are scaled down to match the sum of the smallest matrix.

usage: hicPlotDistVsCounts --matrices MATRICES [MATRICES ...] --plotFile file
                           name [--labels LABELS [LABELS ...]]
                           [--skipDiagonal] [--maxdepth INT bp] [--perchr]
                           [--chromosomeExclude CHROMOSOMEEXCLUDE [CHROMOSOMEEXCLUDE ...]]
                           [--domains DOMAINS] [--outFileData OUTFILEDATA]
                           [--plotsize PLOTSIZE PLOTSIZE] [--help] [--version]
Required arguments
--matrices, -m

Hi-C normalized (corrected) matrices. Each path should be separated by a space.

--plotFile, -o

File name to save the file. The given file ending will be used to determine the image format. The available options are: .png, .emf, .eps, .pdf and .svg.

Optional arguments
--labels

Label to assign to each matrix file. Each label should be separated by a space. Quote labels that contain spaces: E.g. –labels label1 “labels 2”. If no labels are given then the file name is used.

--skipDiagonal, -s

If set, diagonal counts are not included.

Default: False

--maxdepth

Maximum distance from diagonal to use. In other words, distances up to maxDepth are computed. Default is 3 million bp.

Default: 3000000

--perchr

If given, computes and display distance versus Hi-C counts plots for each chromosome stored in the matrices passed to –matrices.

Default: False

--chromosomeExclude

Exclude the given list of chromosomes. This is useful for example to exclude the Y chromosome. The names of the chromosomes should be separated by space.

--domains

Bed file with domains coordinates: instead of evaluating the distance vs. Hi-C counts for intra chromosomal counts, compute it for intra-domains.

--outFileData

If given, the data underlying the plots is saved on this file.

--plotsize

Width and height of the plot (in inches). Default is 6*number of cols, 4 * number of rows. The maximum number of rows is 4. Example: –plotsize 6 5

--version

show program’s version number and exit

Usage example

hicPlotDistVsCounts should be used on corrected matrices with large bins (e.g. at least 50kb bins), otherwise the curves will be very spiky and unstable at longer ranges because of the sparseness of the contacts. The tool hicMergeMatrixBins can be used to merge matrix bins and the tool hicCorrectMatrix can be used for matrix correction before using hicPlotDistVsCounts.

hicPlotDistVsCounts -m \
condition1_sample1_50_bins_merged.h5 \
condition1_sampel2_50_bins_merged.h5 \
condition2_sample1_50_bins_merged.h5 \
condition2_sample2_50_bins_merged.h5 \
-o counts_vs_dist_50_bins_merged.png \
--labels 'Cond 1 Sample 1' 'Cond 1 Sample 2' 'Cond 2 Sample 1' 'Cond 2 Sample 2' \
--maxdepth 20000000 \
--plotsize 5 4.2
_images/counts_vs_dist_50_bins_merged.png

Here, we see that the samples of the first condition are not so well correlated, but they follow the same tendencies and are distinct from the two samples of the second condition. The later are well correlated and display enriched long-range contacts compared to the samples of the first condition.

hicInfo

Prints information about a matrix or matrices including matrix size, number of elements, sum of elements, etc. An example usage is: $ hicInfo -m matrix1.h5 matrix2.h5 matrix3.h5

usage: hicInfo --matrices MATRICES [MATRICES ...] [--outFileName OUTFILENAME]
               [--no_metadata] [--help] [--version]
Required arguments
--matrices, -m

The matrix (or multiple matrices) to get information about. HiCExplorer supports the following file formats: h5 (native HiCExplorer format) and cool.

Optional arguments
--outFileName, -o

File name to save information of the matrix instead of writing it to the bash.

--no_metadata, -nm

Do not use meta data from cooler file to display information. This method is slower and was the default until version 2.2 of HiCExplorer. H5 files always use this parameter.

Default: True

--version

show program’s version number and exit

An example output looks like this:

File:        hic_matrix.h5
WARNING: bin size is not homogeneous. Median 434
Size:        339,684
Sum: 243,779,407
Bin length:  434
Chromosomes: 2L, 2R, 3L, 3R, 4, X
Non-zero elements:   229,122,848
Minimum:     0
Maximum:     2116
NaN bins:    0

With HiCExplorer version 3 we support metadata in cooler files. The default behavior is to use the metadata:

$ hicInfo -m matrix.cool

To use the old method (and default for h5) please add the parameter:

$ hicInfo -m matrix.cool --no_metadata

Tools for Hi-C data analysis

hicCompareMatrices
Background

This tool is useful to compare two matrices in .h5 format by applying operations like difference, ratio or log2ratio after normalization. This can be used to determine the effect of a mutation compared to wild-type samples on contact enrichment, or to see TAD structure modifications near differentially expressed genes between two conditions when followed by hicPlotMatrix. It can also be used to compare two biological replicates.

Description

Takes two matrices as input, normalizes them and applies the given operation. To normalize the matrices each element is divided by the sum of the matrix.

usage: hicCompareMatrices --matrices matrix.h5 matrix.h5 --outFileName
                          OUTFILENAME [--operation {diff,ratio,log2ratio}]
                          [--noNorm] [--help] [--version]
Required arguments
--matrices, -m

Name of the matrices in .h5 format to use, separated by a space.

--outFileName, -o

File name to save the resulting matrix. The output is also a .h5 file.

Optional arguments
--operation

Possible choices: diff, ratio, log2ratio

Operation to apply to the matrices (Default: “log2ratio”).

Default: “log2ratio”

--noNorm

Do not apply normalisation before computing the operation (Default: False).

Default: False

--version

show program’s version number and exit

Usage example

hicCompareMatrices is usually perfomed on corrected matrices (hicCorrectMatrix) with bins merged (hicMergeMatrixBins) depending on the downstream analyses to perform. Here is an example of a log2ratio comparison between M1BP Knockdown and GST cells in Drosophila melanogaster on corrected matrices with 50 bins merged (about 30kb bins).

hicCompareMatrices -m \
M1BP_KD_merge_m50_corrected.h5 \
GST_merge_rf_m50_corrected.h5 \
--operation log2ratio -o m1bp_over_gst_log2_m50.h5

This code outputs a matrix containing the normalized log2ratio values of M1BP_KD_merge_m50_corrected.h5 over GST_merge_rf_m50_corrected.h5. We can then display this matrix using hicPlotMatrix.

hicPlotMatrix -m \
m1bp_over_gst_log2_m50.h5 \
--clearMaskedBins \
--region chr2L:12,000,000-19,000,000 \
--vMin -4 --vMax 4 \
-o m1bp_over_gst_log2_m50_matrix_plot.png
_images/hicCompareMatrices_m1bp_over_gst_log2_m50_matrix_plot.png

In this plot we see that the cells with a M1BP Knockdown display a negative log2ratio compared to the wild-type. Depletion of M1BP thus show a dramatic effect on the distribution of Hi-C contacts in which short range contacts decrease (Ramirez et al. 2017, High-resolution TADs reveal DNA sequences underlying genome organization in flies, https://doi.org/10.1038/s41467-017-02525-w).

Below you can find an example of a log2ratio plot between Hi-C matrices of two biological replicates, no differences are observable which means that the replicates are well correlated.

_images/hicCompareMatrices_QC_log2_m50_matrix_plot.png
hicPCA

Computes PCA eigenvectors for a Hi-C matrix.

$ hicPCA –matrix hic_matrix.h5 -o pca1.bedgraph pca2.bedgraph

usage: hicPCA --matrix MATRIX --outputFileName OUTPUTFILENAME
              [OUTPUTFILENAME ...]
              [--whichEigenvectors WHICHEIGENVECTORS [WHICHEIGENVECTORS ...]]
              [--format {bedgraph,bigwig}]
              [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
              [--method {dist_norm,lieberman}] [--ligation_factor]
              [--extraTrack EXTRATRACK] [--histonMarkType HISTONMARKTYPE]
              [--pearsonMatrix PEARSONMATRIX] [--obsexpMatrix OBSEXPMATRIX]
              [--ignoreMaskedBins] [--help] [--version]
Required arguments
--matrix, -m

HiCExplorer matrix in h5 format.

--outputFileName, -o

File names for the result of the pca.Number of output files must match the number of computed eigenvectors.

Optional arguments
--whichEigenvectors, -we

The list of eigenvectors that the PCA should compute e.g. 1 2 5 will return the first, second and fifth eigenvector. (Default: “1 2”).

Default: “1 2”

--format, -f

Possible choices: bedgraph, bigwig

Output format. Either bedgraph or bigwig (Default: “bigwig”).

Default: “bigwig”

--chromosomes

List of chromosomes to be included in the correlation.

--method

Possible choices: dist_norm, lieberman

possible methods which can be used to build the obs-exp matrix are dist_norm and lieberman (Default: “dist_norm”).

Default: “dist_norm”

--ligation_factor

Setting this flag multiplies a scaling factor to each entry of the expected matrix to take care of the proximity ligation as has been explained in Homer software. This flag is only affective with dist_norm method and will be ignored if lieberman method is chosen.

Default: False

--extraTrack

Either a gene track or a histone mark coverage file (preferably a broad mark) is needed to decide if the values of the eigenvector need a sign flip or not. Please consider: bed files are interpreted as gene tracks and bigwig files as histone marks.

--histonMarkType

Set it to active or inactive. This is only necessary if a histon mark coverage file is given as an extraTrack (Default: “active”).

Default: “active”

--pearsonMatrix, -pm

Internally the input matrix is converted per chromosome to obs_exp matrix and consecutively to a Pearson matrix. Set this parameter to write the pearson matrix to a file.

--obsexpMatrix, -oem

Internally the input matrix is converted per chromosome to obs_exp matrix and consecutively to a Pearson matrix. Set this parameter to write the observe/expected matrix to a file.

--ignoreMaskedBins

Mask bins are usually set to 0. This option removes the masked bins before the PCA is computed. Attention: this will lead to empty PCA regions.

Default: False

--version

show program’s version number and exit

hicTransform

Converts the (interaction) matrix to different types of obs/exp, pearson or covariance matrix.

usage: hicTransform --matrix MATRIX --outFileName OUTFILENAME
                    [--method {obs_exp,obs_exp_lieberman,obs_exp_non_zero,pearson,covariance}]
                    [--ligation_factor]
                    [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                    [--perChromosome] [--help] [--version]
Required arguments
--matrix, -m

input file. The computation is done per chromosome.

--outFileName, -o

File name to save the exported matrix.

Optional arguments
--method, -me

Possible choices: obs_exp, obs_exp_lieberman, obs_exp_non_zero, pearson, covariance

Transformation methods to use for input matrix. Transformation is computed per chromosome.obs_exp computes the expected matrix as the sum per genomic distance j divided by maximal possible contacts: sum(diagonal(j) / number of elements in diagonal(j) obs_exp_lieberman computes the expected matrix as the sum per genomic distance j divided by the : sum(diagonal(j) / (length of chromosome - j))obs_exp_non_zero computes the expected matrix as the sum per genomic distance j divided by sum of non-zero contacts: sum(diagonal(j) / number of non-zero elements in diagonal(j)Optionally, ``–ligation_factor` can be used for this method as has been used by HOMER software. If –ligation_factor, then exp_i,j = exp_i,j * sum(row(i)) * sum(row(j)) / sum(matrix)pearson computes the Pearson correlation of the input matrix: Pearson_i,j = C_i,j / sqrt(C_i,i * C_j,j) and C is the covariance matrixcovariance computes the Covariance of the input matrix: Cov_i,j = E[M_i, M_j] - my_i * my_j where M is the input matrix and my the mean (Default: “obs_exp”).

Default: “obs_exp”

--ligation_factor

Setting this flag, multiplies a scaling factor to each entry of the expected matrix to take care of the proximity ligation as has been explained in Homer software. This flag is only affective with obs_exp_non_zero method and will be ignored if any other obs/exp method is chosen.

Default: False

--chromosomes

List of chromosomes to be included in the computation.

--perChromosome, -pc

Each chromosome is processed individually, inter-chromosomal interactions are ignored. Option not valid for obs_exp_lieberman.

Default: False

--version

show program’s version number and exit

Background

hicTransform transforms a given input matrix into a new matrix using one of the following methods:

  • obs_exp

  • obs_exp_lieberman

  • obs_exp_non_zero

  • pearson

  • covariance

All expected values are computed per genomic distances.

Usage
$ hicTransform -m matrix.cool --method obs_exp -o obs_exp.cool

For all images data from Rao 2014 was used.

Observed / Expected

All values, including non-zero values, are used to compute the expected values per genomic distance.

\[exp_{i,j} = \frac{ \sum diagonal(|i-j|) }{|diagonal(|i-j|)|}\]
_images/obs_exp.png
Observed / Expected lieberman

The expected matrix is computed in the way as Lieberman-Aiden used it in the 2009 publication. It is quite similar to the obs/exp matrix computation.

\[exp_{i,j} = \frac{ \sum diagonal(|i-j|) } {(length\ of\ chromosome\ - |i-j|))}\]
_images/obs_exp_lieberman.png
Observed / Expected non zero

Only non-zero values are used to compute the expected values per genomic distance, i.e. only non-zero values are taken into account for the denominator.

\[exp_{i,j} = \frac{ \sum diagonal(i-j) }{ number\ of\ non-zero\ elements\ in\ diagonal(|i-j|)}\]
_images/obs_exp_norm.png

By adding the –ligation_factor flag, the expected matrix can be re-scaled in the same way as has been done by Homer software when computing observed/expected matrices with the option ‘-norm’.

\[exp_{i,j} = exp_{i,j} * \sum row(j) * \sum row(i) }{ \sum matrix }\]
_images/obs_exp_norm.png
Pearson correlation matrix
\[Pearson_{i,j} = \frac {C_{i,j} }{ \sqrt{C_{i,i} * C_{j,j} }}\]

C is the covariance matrix

_images/pearson.png _images/obs_exp_pearson.png

The first image shows the Pearson correlation on the original interaction matrix, the second one shows the Person correlation matrix on an observed/expected matrix. A consecutive computation like this is used in the A/B compartment computation.

Covariance matrix
\[Cov_{i,j} = E[M_i, M_j] - \mu_i * \mu_j\]

where M is the input matrix and \(\mu\) the mean.

hicAverageRegions

Sums Hi-C contacts around given reference points and computes their average. This tool is useful to detect differences at certain reference points as for example TAD boundaries between samples.

WARNING: This tool can only be used with fixed bin size Hi-C matrices. No guarantees how and if it works on restriction site interaction matrices.

usage: hicAverageRegions --matrix MATRIX --regions REGIONS
                         (--range RANGE RANGE | --rangeInBins RANGEINBINS RANGEINBINS)
                         --outFileName OUTFILENAME [--help]
                         [--coordinatesToBinMapping {start,center,end}]
                         [--version]
Named Arguments
--range, -ra

Range of region up- and downstream of each region to include in genomic units.

--rangeInBins, -rib

Range of region up- and downstream of each region to include in bin units.

Required arguments
--matrix, -m

The matrix to use for the average of TAD regions.

--regions, -r

BED file which stores a list of regions that are summed and averaged

--outFileName, -o

File name to save the average regions TADs matrix.

Optional arguments
--coordinatesToBinMapping, -cb

Possible choices: start, center, end

If the region contains start and end coordinates, define if the start, center (start + (end-start) / 2) or end bin should be used as start for range.This parameter is only important to set if the given start and end coordinates are not in the same bin (Default: “start”).

Default: “start”

--version

show program’s version number and exit

hicAverageRegions takes as input a BED files with genomic positions, these are the reference points to sum up and average the regions up- and downstream of all these positions, good reference points are e.g. the borders of TADs. This can help to determine changes in the chromatin organization and TAD structure changes.

In the following example the 10kb resolution interaction matrix of Rao 2014 is used.

The first step computes the TADs for chromosome 1.

$ hicFindTADs -m GSE63525_GM12878_insitu_primary_10kb_KR.cool --outPrefix TADs
    --correctForMultipleTesting fdr --minDepth 30000 --maxDepth 100000
    --step 10000 -p 20 --chromosomes 1

Next, we use the domains.bed file of hicFindTADs to use the borders of TADs as reference points. As a range up- and downstream of each reference point 100kb are chosen.

$ hicAverageRegions -m GSE63525_GM12878_insitu_primary_10kb_KR.cool
    -r TADs_domains.bed --range 100000 100000 --outFileName primary_chr1

In a last step, the computed average region is plotted.

$ hicPlotAverageRegions -m primary_chr1.npz -o primary_chr1.png
_images/primary_chr1.png
hicDetectLoops

hicDetectLoops can detect enriched interaction regions (peaks / loops) based on a strict candidate selection, negative binomial distributions and Wilcoxon rank-sum tests.

The algorithm was mainly develop on GM12878 cells from Rao 2014 on 10kb and 5kb fixed bin size resolution.

Example usage
$ hicDetectLoops -m matrix.cool -o loops.bedgraph --maxLoopDistance 2000000 --windowSize 10 --peakWidth 6 --pValuePreselection 0.05 --pValue 0.05

The candidate selection is based on the restriction of the maximum genomic distance, here 2MB. This distance is given by Rao 2014. For each genomic distance a continuous negative binomial distribution is computed and only interaction pairs with a threshold less than --pValuePreselection are accepted. In a second step, each candidate is considered compared to its neighborhood. This neighborhood is defined by the --windowSize parameter in the x and y dimension. Per neighborhood only one candidate is considered, therefore only the candidate with the highest peak values is accepted. As a last step, the neighborhood is split into a peak and background region (parameter --peakWidth). The peakWidth can never be larger than the windowSize. However, we recommend for 10kb matrices a windowSize of 10 and a peakWidth of 6.

With version 3.5 a major revision of this tool was published. The biggest changes are:

  • The introduction of an observed/expected matrix as one of the first steps. Based on it, the other calculations compute without a distance dependent factor and the results are more accurate.

  • The testing of peak vs background region with the donut layout as proposed by HiCCUPS with Wilcoxon rank-sum test. Anderson-Darling test is removed.

  • Improving the handling of the parallelization and rewrote of the merging of candidates in one neighborhood. Results in faster execution time and less memory demand.

  • Loading only the interactions within the range of maxLoopDistance. This is possible with HiCMatrix version 13 and results in faster load time and a reduced memory peak. This improvement is only for cool matrices, h5 matrices do not profit from these changes.

hicDetectLoops has many parameters and it can be quite difficult to search for the best parameter setting for your data. With version 3.5 we introduce therefore two new tools hicHyperoptDetectLoops and hicHyperoptDetectLoopsHiCCUPS. The first one searches for the optimal parameter setting for your data based on HiCExplorer’s hicDetectLoops. However, if you want to compare the results to Juicer HiCCUPS, the second tool provides a parameter search for it. Please note that HiCCUPS and any dependency of it are not provided by HiCExplorer and must be installed by the user on its own.

The output file (´´-o loops.bedgraph``) contains the x and y position of each loop and its corresponding p-value of the statistical test.

1   120000000       122500000       1       145000000       147500000       0.001

The results can visualized using hicPlotMatrix:

$ hicPlotMatrix -m matrix.cool -o plot.png --log1p --region 1:18000000-22000000 --loops loops.bedgraph
_images/hicDetectLoops.png

Computes enriched regions (peaks) or long range contacts on the given contact matrix.

usage: hicDetectLoops --matrix MATRIX --outFileName OUTFILENAME
                      [--peakWidth PEAKWIDTH] [--windowSize WINDOWSIZE]
                      [--pValuePreselection PVALUEPRESELECTION]
                      [--peakInteractionsThreshold PEAKINTERACTIONSTHRESHOLD]
                      [--obsExpThreshold OBSEXPTHRESHOLD] [--pValue PVALUE]
                      [--maxLoopDistance MAXLOOPDISTANCE]
                      [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                      [--threads THREADS]
                      [--threadsPerChromosome THREADSPERCHROMOSOME]
                      [--expected {mean,mean_nonzero,mean_nonzero_ligation}]
                      [--help] [--version]
Required arguments
--matrix, -m

The matrix to compute the loop detection on.

--outFileName, -o

Outfile name to store the detected loops. The file will in bedgraph format.

Optional arguments
--peakWidth, -pw

The width of the peak region in bins. The square around the peak will include (2 * peakWidth)^2 bins (Default: 2).

Default: 2

--windowSize, -w

The window size for the neighborhood region the peak is located in. All values from this region (exclude the values from the peak region) are tested against the peak region for significant difference. The square will have the size of (2 * windowSize)^2 bins (Default: 5).

Default: 5

--pValuePreselection, -pp

Only candidates with p-values less the given threshold will be considered as candidates. For each genomic distance a negative binomial distribution is fitted and for each pixel a p-value given by the cumulative density function is given. This does NOT influence the p-value for the neighborhood testing. Can a single value or a threshold file created by hicCreateThresholdFile (Default: 0.1).

Default: 0.1

--peakInteractionsThreshold, -pit

The minimum number of interactions a detected peaks needs to have to be considered (Default: 10).

Default: 10

--obsExpThreshold, -oet

The minimum number of obs/exp interactions a detected peaks needs to have to be considered (Default: 1.5).

Default: 1.5

--pValue, -p

Rejection level for Anderson-Darling or Wilcoxon-rank sum test for H0. H0 is peak region and background have the same distribution (Default: 0.025).

Default: 0.025

--maxLoopDistance

Maximum genomic distance of a loop, usually loops are within a distance of ~2MB (Default: 2000000).

Default: 2000000

--chromosomes

Chromosomes to include in the analysis. If not set, all chromosomes are included.

--threads, -t

Number of threads to use, the parallelization is implemented per chromosome (Default: 4).

Default: 4

--threadsPerChromosome, -tpc

Number of threads to use per parallel thread processing a chromosome. E.g. –threads = 4 and –threadsPerChromosome = 4 makes 4 * 4 = 16 threads in total (Default: 4).

Default: 4

--expected, -exp

Possible choices: mean, mean_nonzero, mean_nonzero_ligation

Method to compute the expected value per distance: Either the mean (mean), the mean of non-zero values (mean_nonzero) or the mean of non-zero values with ligation factor correction (mean_nonzero_ligation) (Default: “mean”).

Default: “mean”

--version

show program’s version number and exit

hicValidateLocations

hicValidateLoops is a tool to compare the detect loops from hicDetectLoops (or from any other software as long as the data format is followed, see below) with known peak protein locations to validate if the computed loops do have the expected anchor points. For example, loops in mammals are usually bound by CTCF or Cohesin, therefore it is important to know if the detect loops have protein peaks at their X and Y position.

_images/loops_bonev_cavalli.png

Loops in Hi-C, graphic from Bonev & Cavalli, Nature Reviews Genetics 2016

Data format

The data format of hicDetectLoops output is:

chr_x start_x end_x chr_y start_y end_y p-value

As --protein the input of narrowPeak or broadPeak files are accepted. However, as long as the --protein input file contains chromosome, start and end in the first three columns, it should work.

This script overlaps the loop locations with protein locations to determine the accuracy of the loop detection. Loops need to have format as follows:

chr start end chr start end

The protein peaks need to be in narrowPeaks or broadPeak format.

A protein match is successfull if at the bin of the x and y location a protein peak is overlapped. A bin is assumed to have a protein if one or more protein peaks falling within the bin region. The value of the protein is not considered, only match or non-match.

usage: hicValidateLocations --data DATA --validationData VALIDATIONDATA
                            [--validationType {bed,cool}]
                            [--method {loops,tad}] --resolution RESOLUTION
                            [--outFileName OUTFILENAME]
                            [--chrPrefixLoops {None,add,remove}]
                            [--chrPrefixProtein {None,add,remove}] [--help]
                            [--version]
Required arguments
--data, -d

The loop file from hicDetectLoops. To use files from other sources, please follow ‘chr start end chr start end’ format.

--validationData, -vd

The data file to validate the given locations. Can be narrowPeak, broadPeak, cool, or 2D text.

--validationType, -vt

Possible choices: bed, cool

The type of the validation data. Can be bed, or cool format

Default: “bed”

--method, -m

Possible choices: loops, tad

The method used (for the moment only loop is possible) (Default: “loops”).

Default: “loops”

--resolution, -r

The used resolution of the Hi-C interaction matrix.

Optional arguments
--outFileName, -o

The prefix name of the output files. Two file are written: output_matched_locations and output_statistics.First file contains all loop locations with protein location matches, second file contains statistics about this matching.

--chrPrefixLoops, -cl

Possible choices: None, add, remove

Adding / removing / do nothing a ‘chr’-prefix to chromosome name of the loops.

--chrPrefixProtein, -cp

Possible choices: None, add, remove

Adding / removing / do nothing a ‘chr’-prefix to chromosome name of the protein.

--version

show program’s version number and exit

hicMergeLoops

This script merges the locations of loops detected at several resolutions.

Loops in the inputFiles need to have the following format:

chr start end chr start end

Loops are merged if the x and y position of a loop overlap with the x and y position of another loop; all loops are considered as an overlap within +/- the bin size of the lowest resolution. I.e. for a loop with coordinates x and y, the overlap with all other loops is checked for (x - lowest resolution) and (y + lowest resolution). If two or more locations are to be merged, the loop at the lowest resolution is taken as the merged loop.

Example usage:

$ hicMergeLoops -i gm12878_10kb.bedgraph gm12878_5kb.bedgraph gm12878_25kb.bedgraph -o merged_result.bedgraph -r 25000

Please recall: We work with binned data i.e. the lowest resolution is therefore the one where we merge the most bases into one bin. In the above example the lowest resultion is 25kb, the highest resolution is 5kb.

usage: hicMergeLoops --inputFiles INPUTFILES [INPUTFILES ...] --outFileName
                     OUTFILENAME --lowestResolution LOWESTRESOLUTION [--help]
                     [--version]
Required arguments
--inputFiles, -i

The loop files from hicDetectLoops. To use files from other sources, please follow ‘chr start end chr start end’ format and remove any header.

--outFileName, -o

The name of the merged loop file.

--lowestResolution, -r

The lowest resolution of all loop files, i.e. 5kb, 10kb and 25kb, please use 25000.

Optional arguments
--version

show program’s version number and exit

hicHyperoptDetectLoops

usage: hicHyperoptDetectLoops --matrix MATRIX --proteinFile PROTEINFILE
                              --maximumNumberOfLoops MAXIMUMNUMBEROFLOOPS
                              [--outputFileName OUTPUTFILENAME]
                              [--resolution RESOLUTION]
                              [--chrPrefixLoops {None,add,remove}]
                              [--threads THREADS] [--runs RUNS] [--help]
                              [--version]
Required arguments
--matrix, -m

The matrix to compute the loops on.

--proteinFile, -p

The protein file to validate the detected loops

--maximumNumberOfLoops, -ml

The maximum number of loops that should be used for optimization computation.

--outputFileName, -o

File names for the result of the optimization (Default: “hyperopt_result.txt”).

Default: “hyperopt_result.txt”

Optional arguments
--resolution, -re

Resolution of matrix (Default: 10000).

Default: 10000

--chrPrefixLoops, -cl

Possible choices: None, add, remove

Adding / removing / do nothing a ‘chr’-prefix to chromosome name of the loops.

--threads, -t

Number of threads (uses the python multiprocessing module) (Default: 4).

Default: 4

--runs, -r

Number of runs of hyperopt (Default: 100).

Default: 100

--version

show program’s version number and exit

hicHyperoptDetectLoopsHiCCUPS

usage: hicHyperoptDetectLoopsHiCCUPS --matrix MATRIX --proteinFile PROTEINFILE
                                     --maximumNumberOfLoops
                                     MAXIMUMNUMBEROFLOOPS --juicerPath
                                     JUICERPATH
                                     [--outputFileName OUTPUTFILENAME]
                                     [--resolution RESOLUTION]
                                     [--chrPrefixLoops {None,add,remove}]
                                     [--runs RUNS] [--threads THREADS]
                                     --normalization NORMALIZATION [--cpu]
                                     [--restricted] [--help] [--version]
Required arguments
--matrix, -m

The matrix to compute the loops on.

--proteinFile, -p

The protein file to validate the detected loops

--maximumNumberOfLoops, -ml

The maximum number of loops that should be used for optimization computation.

--juicerPath, -j

path to juicer.jar

--outputFileName, -o

File names for the result of the optimization (Default: “hyperoptHiCCUPS_result.txt”).

Default: “hyperoptHiCCUPS_result.txt”

Optional arguments
--resolution, -r

Resolution of matrix (Default: 10000).

Default: 10000

--chrPrefixLoops, -cl

Possible choices: None, add, remove

Adding / removing / do nothing a ‘chr’-prefix to chromosome name of the loops.

--runs

Number of runs of hyperopt.

Default: 100

--threads, -t

Number of threads (uses the python multiprocessing module) (Default: 4).

Default: 4

--normalization, -k

Normalization table name

--cpu

use the CPU version

Default: False

--restricted

If the GPU version is used, search only within 8 MB.

Default: False

--version

show program’s version number and exit

hicCompartmentalization

Rearrange the average interaction frequencies using the first PC values to represent the global compartmentalization signal. To our knowledge this has been first introduced and implemented by Wibke Schwarzer et al. 2017 (Nature. 2017 Nov 2; 551(7678): 51–56)

$ hicCompartmentalization –obsexp_matrices obsExpMatrix.h5 –pca pc1.bedgraph -o global_signal.png

usage: hicCompartmentalization --obsexp_matrices OBSEXP_MATRICES
                               [OBSEXP_MATRICES ...] --pca PCA
                               --outputFileName OUTPUTFILENAME
                               [--quantile QUANTILE] [--outliers OUTLIERS]
                               [--outputMatrix OUTPUTMATRIX]
                               [--offset OFFSET [OFFSET ...]] [-h] [--version]
Required arguments
--obsexp_matrices, -m

HiCExplorer matrices in h5/cool format.

--pca

a PCA vector as a bedgraph file with no header. In case of several matrices with different conditions, ie. controltreatment, the PCA of control can be used. Note that only one PCA can be provided.

--outputFileName, -o

Plot to represent the polarization of A/B compartments.

Optional arguments
--quantile, -q

number of quantiles. (Default: 30).

Default: 30

--outliers

precentage of outlier to remove. (Default: 0).

Default: 0

--outputMatrix

output .npz file includes all the generated matrices

--offset

set nan for the distances mentioned as offset from main diagonal, only positive values are accepted! Example: if –offset 0, then values of main diagonal will set to nan and if –offset 0 1 then on top of the main diagonal, +1 and -1 diagonal values are also set to nan.

--version

show program’s version number and exit

PCA to compute the global compartmentalization signal

To our knowledge this method has been first introduced by Wibke Schwarzer et al. 2017 (Nature. 2017 Nov 2; 551(7678): 51–56). In this method, a global (genome-wide) strength for compartmentalization is computed as (AA + BB) / (AB + BA) after rearranging the bins of obs/exp based on their corresponding pc1 values. For this purpose, first pc1 values are reordered incrementally, then the same order of bins is used to rearrange the bins in obs/exp matrix.

$ _hicCompartmentalization --obsexp_matrices obsExpMatrix.h5 --pca pc1.bedgraph
  -o global_signal.png
hicPlotSVL
Description

Plots the relation between short and long range interactions as boxplots and if more than one matrix is given, p-values of the distributions are computed. An example usage is: $ hicPlotSVL -m hmec_10kb.cool nhek_10kb.cool

The datapoints per sample are the ratios per chromosome.

usage: hicPlotSVL --matrices MATRICES [MATRICES ...]
                  [--plotFileName PLOTFILENAME] [--outFileName OUTFILENAME]
                  [--outFileNameData OUTFILENAMEDATA] [--distance DISTANCE]
                  [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                  [--threads THREADS] [--dpi DPI]
                  [--colorList COLORLIST [COLORLIST ...]] [--help] [--version]
Required arguments
--matrices, -m

The matrix (or multiple matrices) to use for the comparison

Optional arguments
--plotFileName, -pfn

Plot name (Default: “plot.png”).

Default: “plot.png”

--outFileName, -o

File the p-values are written to, p-values are only computed if at least two matrices are given (Default: “p_values.txt”).

Default: “p_values.txt”

--outFileNameData, -od

File the computed ratios are written to (Default: “data.txt”).

Default: “data.txt”

--distance, -d

Distance (in bp) which should be considered as short range. Default 2MB (2000000).

Default: 2000000

--chromosomes

Chromosomes to include in the analysis. If not set, all chromosomes are included.

--threads, -t

Number of threads. Using the python multiprocessing module (Default: 4).

Default: 4

--dpi

Optional parameter: Resolution for the image in case theoutput is a raster graphics image (e.g png, jpg) (Default: 300).

Default: 300

--colorList, -cl

Colorlist for the boxplots (Default: g b c m y k).

Default: [‘g’, ‘b’, ‘c’, ‘m’, ‘y’, ‘k’]

--version

show program’s version number and exit

hicPlotSVL computes the ratio between short range and long range contacts per chromosome independently. Per sample one box plot is created and, if more than one sample is given, the computed ratios are assumed to be one distribution and a Wilcoxon rank-sum test under H0 ‘distributions are equal’ is computed. All used data is written to a third raw data file.

The distance to distinct short and long range contacts can be set via the –distance parameter; for short range the sum for all contacts smaller or equal this distance are computed, for long range contacts all contacts greater this distance.

Usage example
$ hicPlotSVL -m hmec_10kb.cool nhek_10kb.cool hmec_10kb.cool --distance 2000000 --threads 4 --plotFileName plot.png --outFileName pvalues.txt --outFileNameData rawData.txt

This results in three files:

The raw data containing the sums for short range, long range and ratio per sample and chromosome.

# Created with HiCExplorer's hicPlotSVL 3.3
# Short range vs long range contacts per chromosome: raw data
# Short range contacts: <= 2000000
#   hmec_10kb.cool                  nhek_10kb.cool                  hmec_10kb.cool
# Chromosome        Ratio   Sum <= 2000000  Sum > 2000000   Ratio   Sum <= 2000000  Sum > 2000000   Ratio   Sum <= 2000000  Sum > 2000000
1   3.0399769346724543      33476834        11012200        2.79740105237572        44262902        15822866        3.0399769346724543      33476834        11012200
2   2.7532203542810625      31723954        11522490        2.5007877714355438      47468438        18981394        2.7532203542810625      31723954        11522490
3   2.922650759458664       26251027        8981924 2.6235211241878442      39640848        15109788        2.922650759458664       26251027        8981924
4   2.7235598858451637      22474680        8251950 2.5572455199457864      37486882        14659086        2.7235598858451637      22474680        8251950
5   2.9585962905193712      22716268        7678056 2.752922527526723       35445722        12875670        2.9585962905193712      22716268        7678056
6   3.168274165465025       22872690        7219290 2.8602111006131703      33990211        11883812        3.168274165465025       22872690        7219290
7   3.1093346580597516      19603416        6304698 2.8021236966788887      29712823        10603680        3.1093346580597516      19603416        6304698
8   3.135391026076832       18355087        5854162 2.7964394470859024      28660624        10248970        3.135391026076832       18355087        5854162
9   4.1147978383348125      15395763        3741560 3.819940066283481       21994046        5757694 4.1147978383348125      15395763        3741560
10  3.448063050802953       17964043        5209894 3.1116673856502253      26270171        8442474 3.448063050802953       17964043        5209894
11  3.5924666993070407      18651850        5191934 3.1364875011923035      26240350        8366158 3.5924666993070407      18651850        5191934
12  3.6817551043464416      18640866        5063038 3.306662109403207       26101554        7893626 3.6817551043464416      18640866        5063038
13  3.476204237522881       11018462        3169682 3.0976674036654805      18922281        6108558 3.476204237522881       11018462        3169682
14  3.70550850832778        11164875        3013048 3.6226817463785164      17245704        4760480 3.70550850832778        11164875        3013048
15  4.607631079612186       11165313        2423222 4.567998349104569       15273742        3343640 4.607631079612186       11165313        2423222
16  4.397874357146307       10745775        2443402 3.890983210350018       14666462        3769346 4.397874357146307       10745775        2443402
17  5.809374740402161       12168235        2094586 5.3360710927739285      14154110        2652534 5.809374740402161       12168235        2094586
18  3.7647349280938895      9339833 2480874 3.485487446356812       15019063        4309028 3.7647349280938895      9339833 2480874
19  6.492239632778196       8466283 1304062 5.774337450385819       9368978 1622520 6.492239632778196       8466283 1304062
20  5.542933774973686       8962935 1617002 4.977679877778358       12009479        2412666 5.542933774973686       8962935 1617002
21  6.665622315255486       3910374 586648  6.1843701763589225      6554715 1059884 6.665622315255486       3910374 586648
22  8.063663557923096       4992327 619114  7.433759425439728       5932928 798106  8.063663557923096       4992327 619114
X   2.208752982178897       14424173        6530460 2.3130534357407995      27628734        11944702        2.208752982178897       14424173        6530460
Y   4.165021803993573       36294   8714    3.8063291139240505      45105   11850   4.165021803993573       36294   8714
MT

The p-values between the samples:

# Created with HiCExplorer's hicPlotSVL 3.3
# Short range vs long range contacts per chromosome, p-values of each distribution against each other distribution with Wilcoxon rank-sum
# Short range contacts: <= 2000000
hmec_10kb.cool      nhek_10kb.cool  0.28362036331636575
hmec_10kb.cool      hmec_10kb.cool  1.0
nhek_10kb.cool      hmec_10kb.cool  0.28362036331636575

The box plot:

_images/plot_svl.png

Tools for TADs processing

hicFindTADs
Description

Uses a measure called TAD-separation score to identify the degree of separation between the left and right regions at each Hi-C matrix bin. This is done for a running window of different sizes. Then, TADs are called as those positions having a local TAD-separation score minimum. The TAD-separation score is measured using the z-score of the Hi-C matrix and is defined as the mean zscore of all the matrix contacts between the left and right regions (diamond). To find the TADs, the program needs to compute first the TAD scores at different window sizes. Then, the results of that computation are used to call the TADs. This is convenient to test different filtering criteria quickly as the demanding step is the computation of TAD-separation scores.

A simple example usage is:

$ hicFindTads -m hic_matrix.h5 –outPrefix TADs –correctForMultipleTesting fdr The bedgraph file produced by this tool can be used to plot the so-called insulation score along the genome or at specific regions. This score is much more reliable across samples than the number of TADs or the TADs width that can vary depending on the sequencing depth because of the lack of information at certain bins, and depending on the parameters used with this tool.

usage: hicFindTADs --matrix MATRIX --outPrefix OUTPREFIX
                   --correctForMultipleTesting {fdr,bonferroni,None}
                   [--minDepth INT bp] [--maxDepth INT bp] [--step INT bp]
                   [--TAD_sep_score_prefix TAD_SEP_SCORE_PREFIX]
                   [--thresholdComparisons THRESHOLDCOMPARISONS]
                   [--delta DELTA] [--minBoundaryDistance MINBOUNDARYDISTANCE]
                   [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                   [--numberOfProcessors NUMBEROFPROCESSORS] [--help]
                   [--version]
Required arguments
--matrix, -m

Corrected Hi-C matrix to use for the computations.

--outPrefix

File prefix to save the resulting files: 1. <prefix>_tad_separation.bm The format of the output file is chrom start end TAD-sep1 TAD-sep2 TAD-sep3 .. etc. We call this format a bedgraph matrix and can be plotted using hicPlotTADs. Each of the TAD-separation scores in the file corresponds to a different window length starting from –minDepth to –maxDepth. 2. <prefix>_zscore_matrix.h5, the z-score matrix used for the computation of the TAD-separation score. 3. < prefix > _boundaries.bed, which contains the positions of boundaries. The genomic coordinates in this file correspond to the resolution used. Thus, for Hi-C bins of 10.000bp the boundary position is 10.000bp long. For restriction fragment matrices the boundary position varies depending on the fragment length at the boundary. 4. <prefix>_domains.bed contains the TADs positions. This is a non-overlapping set of genomic positions. 5. <prefix>_boundaries.gff Similar to the boundaries bed file but with extra information (p-value, delta). 6. <prefix>_score.bedgraph file contains the TAD-separation score measured at each Hi-C bin coordinate. Is useful to visualize in a genome browser. The delta and p-value settings are saved as part of the name.

--correctForMultipleTesting

Possible choices: fdr, bonferroni, None

Select the bonferroni or false discovery rate for a multiple comparison. Bonferroni controls the family-wise error rate (FWER) and needs a p-value. The false discovery rate (FDR) controls the likelyhood of type I errors and needs a q-value. As a third option it is possible to not use a multiple comparison method at all (Default: “fdr”).

Default: “fdr”

Optional arguments
--minDepth

Minimum window length (in bp) to be considered to the left and to the right of each Hi-C bin. This number should be at least 3 times as large as the bin size of the Hi-C matrix.

--maxDepth

Maximum window length to be considered to the left and to the right of the cut point in bp. This number should around 6-10 times as large as the bin size of the Hi-C matrix.

--step

Step size when moving from –minDepth to –maxDepth. Note, the step size grows exponentially as minDeph + (step * int(x)**1.5) for x in [0, 1, …] until it reaches maxDepth. For example, selecting step=10,000, minDepth=20,000 and maxDepth=150,000 will compute TAD-scores for window sizes: 20,000, 30,000, 40,000, 70,000 and 100,000

--TAD_sep_score_prefix

Sometimes it is useful to change some of the parameters without recomputing the z-score matrix and the TAD-separation score. For this case, the prefix containing the TAD separation score and the z-score matrix can be given. If this option is given, new boundaries will be computed but the values of –minDepth, –maxDepth and –step will not be used.

--thresholdComparisons

P-value threshold for the Bonferroni correction / q-value for FDR. The probability of a local minima to be a boundary is estimated by comparing the distribution (Wilcoxon ranksum) of the z-scores between the left and right regions (diamond) at the local minimum with the matrix z-scores for a diamond at –minDepth to the left and a diamond –minDepth to the right. If –correctForMultipleTesting is ‘None’ the threshold is applied on the raw p-values without any multiple testing correction. Set it to ‘1’ if no threshold should be used (Default: 0.01).

Default: 0.01

--delta

Minimum threshold of the difference between the TAD-separation score of a putative boundary and the mean of the TAD-sep. score of surrounding bins. The delta value reduces spurious boundaries that are shallow, which usually occur at the center of large TADs when the TAD-sep. score is flat. Higher delta threshold values produce more conservative boundary estimations (Default: 0.01).

Default: 0.01

--minBoundaryDistance

Minimum distance between boundaries (in bp). This parameter can be used to reduce spurious boundaries caused by noise.

--chromosomes

Chromosomes and order in which the chromosomes should be plotted. This option overrides –region and –region2.

--numberOfProcessors, -p

Number of processors to use (Default: 1).

Default: 1

--version

show program’s version number and exit

Usage example

It is recommended to test multiple parameters of TAD calling with hicFindTADs before making conclusions about the number of TADs in a given sample or before comparing TAD calling between multiple conditions. In order to compare several TAD calling parameters at once, it is recommended to use hicPlotTADs.

Below you can find a typical command-line to use hicFindTADs:

$ hicFindTADs -m myHiCmatrix.h5 \
--outPrefix myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr \
--minDepth 3000 \
--maxDepth 31500 \
--step 1500 \
--thresholdComparisons 0.05 \
--delta 0.01 \
--correctForMultipleTesting fdr \
-p 64

This command will output the following files:

myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_boundaries.bed
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_boundaries.gff
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_domains.bed
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_score.bedgraph
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_score.npz
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_tad_score.bm
myHiCmatrix_min3000_max31500_step1500_thres0.05_delta0.01_fdr_zscore_matrix.h5

TAD boundaries locations are stored in the boundaries files, domains.bed file contains the TAD locations, score files contain TAD separation score, or the so-called TAD insulation score, in various formats. As a side note, the tad_score.bm file is a bedgraph matrix that can be used to display TAD separation score curves in hicPlotTADs for example.

The zscore_matrix.h5 file contains a z-score matrix that is useful to quickly test the –thresholdComparisons, –delta and –correctForMultipleTesting parameters by using the –TAD_sep_score_prefix option pointing to this zscore_matrix.h5 file. For example to quickly test a –thresholdComparisons of 0.01 instead of 0.05 we can run the following command:

$ hicFindTADs -m myHiCmatrix.h5 \
--outPrefix myHiCmatrix_min10000_max40000_step1500_thres0.01_delta0.01_fdr \
--TAD_sep_score_prefix myHiCmatrix_min10000_max40000_step1500_thres0.05_delta0.01_fdr
--thresholdComparisons 0.01 \
--delta 0.01 \
--correctForMultipleTesting fdr \
-p 64

As you can see above, –minDepth, –maxDepth and –step are ignored because these parameters are used to calculate the z-score matrix which is here provided to –TAD_sep_score_prefix. Since z-score matrix computation is the most demanding step of hicFindTADs in terms of memory and computation, the above command will thus run significantly faster than the previous one.

Multiple combinations of parameters can be tested that way with only one z-score matrix computation. To compare several TAD calling outputs, we use hicPlotTADs with the following command using, for example, the following tracks.ini file:

  • command line:

$ hicPlotTADs --tracks tracks.ini --region chrX:6800000-8500000  -o TAD_calling_comparison.png
  • tracks.ini:

[x-axis]
fontsize=10

[hic]
file = myHiCmatrix.h5
title = Threshold 0.05
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.05_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[hic]
file = myHiCmatrix.h5
title = Threshold 0.01
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.01_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[spacer]
height = 0.1

[hic]
file = myHiCmatrix.h5
title = Threshold 0.005
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.005_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[spacer]
height = 0.1

[hic]
file = myHiCmatrix.h5
title = Threshold 0.001
colormap = Spectral_r
depth = 400000
min_value = 1
max_value = 80
transform = log1p
file_type = hic_matrix
show_masked_bins = false

[tads]
file = myHiCmatrix_min10000_max40000_step1500_thres0.001_delta0.01_fdr_domains.bed
file_type = domains
border_color = black
color = none
overlay_previous = share-y

[spacer]
height = 0.1

[bigwig]
file = /data/processing/ChIP-Seq_Embryos/H3K36me3_14c.bigwig
title = H3K36me3
color = darkred
min_value = 0
max_value = auto
height = 2
file_type = bigwig

[spacer]
height = 0.1

[genes]
file = /data/group/bedfiles/dm6/genes_sorted.bed
title = genes
color = black
height = 18
labels = true
file_type = bed

This will result in the following plot where we see that the fourth set of hicFindTADs parameters with a threshold of 0.001 gives the best results in terms of TAD calling compared to the corrected Hi-C counts distribution and compared to the enrichment of H3K36me3, which is known to be enriched at TAD boundaries in Drosophila melanogaster.

_images/hicFindTADs_TAD_calling_comparison.png
Notes

In the _domains.bed output file, the 5th column contains the TAD-separation score at the boundary located at the start of each domain.

The process to identify boundaries is as follows:

  • call all local minima in the average TAD-score. Each local minimum should be separated by at least min_boundary_distance. If this value is not given, it is set to the average bin size * 4

  • for each local minimum detected, compute its p-value and then compute a q-value.

  • for each local minimum detected, compute the ‘delta’ which is the difference between the mean TAD-score of the 10 bins before and the 10 bins after the minimum (excluding the min point)

  • keep only those minima that fulfill the following criteria: the p-value (or q-value depending on the user selection) should be below the given threshold and the delta should be above the user defined threshold

  • everything between 2 consecutive boundaries is a TAD

For computation of the p-values, the distribution of the z-scores at the ‘diamond’ above the local minimum is compared with the distribution of z-scores that are min_depth downstream using the Wilcoxon rank-sum test. Similarly, the distribution of z-scores is computed with the z-scores min_depth upstream of the local minimum. The smallest of the two p-values is assigned to the local minimum.

If min_depth is not given, this is computed as bin size * 30 (if the bins are smaller than 1000), as bin size * 10 if the bins are between 1000 and 20.000 and as bin size * 5 if the bin size is bigger than 20.000.

If max_depth is not given, this is computed as bin size * 60 (if the bins are smaller than 1000), as bin size * 40 if the bins are between 1000 and 20.000 and as bin size * 10 if the bin size is bigger than 20.000.

hicMergeDomains

hicMergeDomains takes as input multiple TAD domain files from hicFindTads. It merges TADs from different resolutions to one TAD domains file, considers protein peaks from known TAD binding sites and computes a dependency graph of the TADs.

Two TADs are considered as one if they don’t overlap at x bins given by –value; TAD borders need to match the protein peaks given by –proteinFile; a relation between two TADs is given by their overlap of area in percent, parameter –percent. The protein peaks are only considered if in one bin at least –minPeak.

An example usage is:

$ hicMergeDomains –domainFiles 10kbtad_domains.bed 50kbtad_domains.bed –proteinFile ctcf_sorted.bed –outputMergedList two_files_ctcf –outputRelationList two_files_relation_ctcf –outputTreePlotPrefix two_files_plot_ctcf –outputTreePlotFormat pdf

usage: hicMergeDomains --domainFiles DOMAINFILES [DOMAINFILES ...]
                       [--proteinFile PROTEINFILE]
                       [--minimumNumberOfPeaks MINIMUMNUMBEROFPEAKS]
                       [--value VALUE] [--percent PERCENT]
                       [--outputMergedList OUTPUTMERGEDLIST]
                       [--outputRelationList OUTPUTRELATIONLIST]
                       [--outputTreePlotPrefix OUTPUTTREEPLOTPREFIX]
                       [--outputTreePlotFormat OUTPUTTREEPLOTFORMAT] [--help]
                       [--version]
Required arguments
--domainFiles, -d

The domain files of the different resolutions is required

Optional arguments
--proteinFile, -p

In order to be able to better assess the relationship between TADs, the associated protein file (e.g. CTCF for mammals) can be included. The protein file is required in broadpeak format

--minimumNumberOfPeaks, -m

Optional parameter to adjust the number of protein peaks when adapting the resolution to the domain files. At least minimumNumberOfPeaks of unique peaks must be in a bin to considered. Otherwise the bin is treated like it has no peaks (Default: 1).

Default: 1

--value, -v

Determine a value by how much the boundaries of two TADs must at least differ to consider them as two separate TADs (Default: 5000).

Default: 5000

--percent, -pe

For the relationship determination, a percentage is required from which area coverage the TADs are related to each other.For example, a relationship should be entered from 5 percent area coverage -p 0.05 (Default: 0.5).

Default: 0.5

--outputMergedList, -om

File name for the merged domains list (Default: “mergedDomains.bed”).

Default: “mergedDomains.bed”

--outputRelationList, -or

File name for the relationship list of the TADs (Default: “relationList.txt”).

Default: “relationList.txt”

--outputTreePlotPrefix, -ot

File name prefix for the relationship tree of the TADs (Default: “relationship_tree_”).

Default: “relationship_tree_

--outputTreePlotFormat, -of

File format of the relationship tree. Supported formats are listed on: https://www.graphviz.org/doc/info/output.html (Default: “pdf”).

Default: “pdf”

--version

show program’s version number and exit

This tool is the result of the bachelor thesis (in German) from Sarah Domogalla at the Bioinformatics lab of the Albert-Ludwigs-University Freiburg. The thesis was written in winter semester 2019/2020.

The thesis covers the issue of differently detected TADs of the same data but on different resolutions.

_images/tads_different_resolution.png

As it can be seen here, the detected TADs on the four resolutions are quite different, and have maybe a hierarchy. Compare for example the region around 53 Mb in the different resolutions, it is unclear if these are multiple small TADs as suggest by the 10kb resolution, mid-size one as seen by 50 kb or one big, as detected on 100 kb resolution. Consider the next image for an abstract representation of this problem.

_images/tad_hier.png

The abstract representation shows multiple issues: Are ID_85,ID_86 and ID_87 related in any hierarchy? Which role plays ID_75? hicMergeDomains tries to solve this by providing the parameters –value and –percentage to control how the relationship is determined and if TADs are duplicated and therefore deleted.

_images/tads_hier_abstract.png

This abstract representation of the hierarchical TADs is created for each chromosome and provides insights for the organization of the chromatin structure in the three dimensional space.

_images/tads_all_clean.png

The last image shows a cleaned TAD detection based on all four inputs.

hicDifferentialTAD

Computes differential TADs by comparing the precomputed TAD regions of the target matrix with the same regions of the control matrix. Please notice that the matrices need to have the same read coverage, this can be achieved with hicNormalize and the ‘smallest’-mode. H0 is the assumption that two regions are identical, the rejected files contain therefore the as differential considered regions.

usage: hicDifferentialTAD [--targetMatrix TARGETMATRIX]
                          [--controlMatrix CONTROLMATRIX]
                          [--tadDomains TADDOMAINS]
                          [--outFileNamePrefix OUTFILENAMEPREFIX]
                          [--pValue PVALUE]
                          [--mode {intra-TAD,left-inter-TAD,right-inter-TAD,all}]
                          [--modeReject {all,one}] [--threads THREADS]
                          [--help] [--version]
Required arguments
--targetMatrix, -tm

The matrix which was used to compute the TADs

--controlMatrix, -cm

The control matrix to test the TADs for a differential interaction pattern.

--tadDomains, -td

The TADs domain file computed by hicFindTADs.

--outFileNamePrefix, -o

Outfile name prefix to store the accepted / rejected H0 TADs.

Default: “output_differential_tad”

Optional arguments
--pValue, -p

H0 is considered as ‘two regions are identical.’ i.e. all regions with a test result of <= p-value are rejected and considered as differential (Default: 0.05).

Default: 0.05

--mode, -m

Possible choices: intra-TAD, left-inter-TAD, right-inter-TAD, all

Consider only intra-TAD interactions, or additional left inter-TAD, right inter-TAD or all (Default: “all”).

Default: “all”

--modeReject, -mr

Possible choices: all, one

All test of a mode must be rejected (all) or reject region (and accept it is differential) as soon as at least one region is having a p-value <= –pValue (Default: “one”).

Default: “one”

--threads, -t

Number of threads to use, the parallelization is implemented per chromosome (Default: 4).

Default: 4

--version

show program’s version number and exit

hicDifferentialTAD computes with a treatment Hi-C file, a control Hi-C file and a precomputed TAD domains file if the detected TADs are differential between the treatment and the control sample. The TADs need to be precomputed on the treatment file with _hicFindTADs.

hicDifferentialTAD extract per TAD three regions: the intra-TAD, the left and right inter-TAD region. In the following image, the upper visualization shows a region with the detected TADs which are indicated by the black lines. The bottom shows as an example which regions are used for the differential test: the intra-TAD region is highlighted in red, the left inter-TAD in sandy-color and the right inter-TAD in blue. Between two samples a Wilcoxon rank-sum test is applied for a TAD under H0: ‘The regions are equal’. For all three regions of a TAD the rank-sum test is independently applied. The user has the choice with the two parameters ‘mode’ and ‘modeReject’ to define if a) all three regions should be considered (‘all’), or only a subset e.g. the intra-TAD or intra-TAD and left inter-TAD should be considered; and b) if all regions need to have lower p-value than the user given to reject H0 or if it is enough that at least one region is rejecting H0 to consider the region as differential.

_images/hicDifferentialTAD_example1.png
Example usage
$ hicDifferentialTAD -tm GSM2644945_Untreated-R1.100000_chr1.cool -cm GSM2644947_Auxin2days-R1.100000_chr1.cool -td untreated_R1_domains.bed -o differential -p 0.01 -t 4 -mr all

In this example data from Nora et al. “Targeted Degradation of CTCF Decouples Local Insulation of Chromosome Domains from Genomic Compartmentalization”, Cell 169.5 (2017): 930-944 is used [GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98671]

_images/hicDifferentialTAD.png

The output are two BED-similar files: the ‘_accepted.diff_tad’ and ‘_rejected.diff_tad’ file. The difference to a real BED file is a) the usage of a header starting with ‘#’, the first six columns are BED6 standard, however, there are three additional columns with the p-values of each intra-TAD, left-inter-TAD and right-inter-TAD test. The score value and name is copied from the domains.bed file which is a output of hicFindTADs.

# Created with HiCExplorer's hicDifferentialTAD version 3.5-dev
# H0 'regions are equal' H0 is rejected for all p-value smaller or equal the user given p-value threshold; i.e. regions in this file are considered as differential.
# Rejected regions with Wilcoxon rank-sum test to p-value: 0.01  with used mode: all and modeReject: all
# Chromosome        start   end     name    score   strand  p-value left-inter-TAD  p-value right-inter-TAD p-value intra-TAD
chr1        4400000 6200000 ID_0.01_1       -0.5630275      .       nan     0.42772134044376386     0.0001942482956610518
chr1        6200000 7300000 ID_0.01_2       -0.235798       .       0.3479058102348157      0.651801360704674       0.011174626122333891
chr1        7300000 9500000 ID_0.01_3       -0.44334        .       0.7619708385959597      0.9408966423531526      4.668547072331386e-06
chr1        10100000        11600000        ID_0.01_5       -0.6844015      .       0.10355136240139871     0.2022260523997077      0.006197912102328514
chr1        11600000        12900000        ID_0.01_6       -0.607587       .       0.39564035674322084     0.04934747494654432     5.069275897797787e-07
chr1        12900000        13700000        ID_0.01_7       -0.6303795      .       0.030328896271663634    0.048097680638217614    0.0023999817017426378
chr1        13700000        14900000        ID_0.01_8       -1.2063985      .       0.07605786424686575     0.5322636852613494      1.8146100883843892e-06
chr1        14900000        17100000        ID_0.01_9       -0.3406565      .       0.4982863585572014      0.692745144471043       0.002562121461829293
chr1        17100000        18100000        ID_0.01_10      -0.230354       .       0.8237750360973404      0.011911117420279341    0.00020602838341698515
chr1        18100000        19100000        ID_0.01_11      -0.5135365      .       0.007089417650089524    0.07553560212368073     0.03561337274424189
hicMergeTADbins

Uses a BED file of domains or TAD boundaries to merge the bin counts of a Hi-C matrix per TAD.

The output matrix contains the total counts per TAD and the total contacts with all other TADs.

usage: hicMergeTADbins [-h] --matrix MATRIX --domains DOMAINS --outFile
                       OUTFILE [--version]
Named Arguments
--matrix, -m

Path to Hi-C matrix to use.

--domains

Path to a bed file containing the domains.

--outFile, -o

Name for the resulting matrix file.

--version

show program’s version number and exit

hicInterIntraTAD

Extracts and computes different inter and intra TAD values and ratios.

usage: hicInterIntraTAD [--matrix MATRIX] [--tadDomains TADDOMAINS]
                        [--outFileName OUTFILENAME]
                        [--outFileNameRatioPlot OUTFILENAMERATIOPLOT]
                        [--fontsize FONTSIZE] [--dpi DPI] [--threads THREADS]
                        [--help] [--version]
Required arguments
--matrix, -m

The matrix which was used to compute the TADs

--tadDomains, -td

The TADs domain file computed by hicFindTADs.

--outFileName, -o

Outfile name

Default: “output_interintra_tad.tzt”

Optional arguments
--outFileNameRatioPlot, -op

Outfile name for the inter-left/intra vs inter-right/intra ratio plot

Default: “ratio.png”

--fontsize

Fontsize in the plot for x and y axis.

Default: 15

--dpi

The dpi of the scatter plot.

Default: 300

--threads, -t

Number of threads to use, the parallelization is implemented per chromosome (Default: 4).

Default: 4

--version

show program’s version number and exit

This tool computes and extracts for given TADs the inter-left, inter-right and intra-TAD data: the absolute number of contacts, the density (non-zero contacts / all possible contacts) and the ratio of inter-left/intra, inter-right/intra and (inter-left + inter-right)/intra TAD contacts. The data is saved in one file where it can be used for further, user-specific computations outside of HiCExplorer. Also a scatter plot of the ratio inter-left/intra vs. inter-right/intra is created.

# Created with HiCExplorer's hicInterIntraTAD version 3.7-dev
# Chromosome        start   end     name    score   strand  inter_left_sum  inter_right_sum inter_left_density      inter_right_density     inter_left_number_of_contacts   inter_right_number_of_contacts  inter_left_number_of_contacts_nnz       inter_right_number_of_contacts_nnz      intra_sum       intra_number_of_contacts        intra_number_of_contacts_nnz    intra_density   inter_left_intra_ratio  inter_right_intra_ratio inter_left_inter_right_intra_ratio
chr1        4400000 6200000 ID_0.01_1       -0.5630275      .       0       265.6981737879304       0       1.0     0       180     0       180     780.0186987409819       324     324     1.0     0.0     0.340630518494993       0.340630518494993
chr1        6200000 7300000 ID_0.01_2       -0.235798       .       288.00513572726237      327.503479611623        1.0     1.0     198     231     198     231     339.91508704783513      121     121     1.0     0.8472855330682405      0.9634861531331044      1.8107716862013452
chr1        7300000 9500000 ID_0.01_3       -0.44334        .       340.12385944568155      159.40880484157745      1.0     1.0     242     110     242     110     1078.1133262629996      484     484     1.0     0.31548061892958207     0.14785904316211984     0.46333966209170185
chr1        9500000 10100000        ID_0.01_4       -1.021538       .       186.65816676710497      124.2235190772146       1.0     1.0     132     84      132     84      118.58286349492002      36      36      1.0     1.5740737005823884      1.0475672067282826      2.6216409073106712
_images/ratio.png
hicTrainTADClassifier
Description

Check out hicTADClassifier for calling TADs using our default classifiers or your trained classifier.

This program can be used to train and test new and existing classifiers for hicTADClassifier. These classifiers can later be run to call boundaries for TADs. There are four modes available: train_new, train_existing, train_test and predict_test. By default, an EasyEnsembleClassifier as described in Liu et al.: “Exploratory Undersampling for Class-Imbalance Learning” will be trained, but you can pass any sklearn classifier that allows for a warm start. You may also vary the resampling method and a range of hyperparameters to fine tune the model. Do mind to set the correct normalization method and resolution for the classifier. The program will check and raise warnings, when resolutions and normalization methods are mixed up. Also, a protein track file in the narrowPeak format with a threshold value may be passed to filter out low quality boundaries.

train_test mode: this is a convenience function, where a single matrix/domains set can be passed to quickly assert the performance of a new classifier. Nothing will be saved from this mode, instead, the classifier will be trained on 80% of the data and tested on the remaining 20%. The output will be a performance report. A quick usage example can be seen here:

$ hicTrainTADClassifier -m ‘train_test’ -f ‘my_test_matrix.cool’ -d ‘domains.bed’ -o ‘report.txt’ -n ‘range’ -r 10000

train_new mode: this mode allows the training of a new classifier. Note that range of optional arguments, that can be used to fine tune. The resulting classifier will be pickled at the specified out_file. A quick example can be seen here, where we varied the feature distance:

$ hicTrainTADClassifier -m ‘train_new’ -f ‘my_test_matrix.cool’ -d ‘domains.bed’ -o ‘new_classifier.data’ -n ‘range’ -r 10000 –distance 18

train_existing mode: train the classifier specified in saved_classifier on new data. When not setting the saved_classifier, the preset classifiers will be used as preset. The output will be the classifier trained with additional data and more internal estimators.

$ hicTrainTADClassifier -m ‘train_existing’ -f ‘my_test_matrix.cool’ -d ‘domains.bed’ -o ‘updated_classifier.data’ -n ‘range’ -r 10000

predict_test mode: predict using an existing classifier and produce a classification report. The difference in using this over hicTADClassifier is, that this version will predict on a balanced test set. Normally, HiC-Matrices contain a lot more non-boundaries than boundaries, which skews the classification report to the point, where it does not contain usefull information anymore. By passing a domain file produced by another TAD Caller, hicTrainClassifier will build a test set using the boundaries of this domain file and will pick at random as many non-boundaries from the passed matrix. Use this over hicTADClassifier to produce a meaningful output, but not for TAD calling.

$ hicTrainTADClassifier -m ‘predict_test’ -f ‘my_test_matrix.cool’ -d ‘domains.bed’ -o ‘report.txt’ -n ‘range’ -r 10000

usage: hicTrainTADClassifier --mode
                             {train_new,train_existing,train_test,predict_test}
                             --matrices MATRICES [MATRICES ...] --domain_file
                             DOMAIN_FILE [DOMAIN_FILE ...] --out_file OUT_FILE
                             [--normalization_method {obs_exp,range}]
                             [--resolution RESOLUTION] [--threshold THRESHOLD]
                             [--leniency LENIENCY]
                             [--saved_classifier SAVED_CLASSIFIER]
                             [--unselect_border_cases]
                             [--protein_file PROTEIN_FILE [PROTEIN_FILE ...]]
                             [--threads THREADS]
                             [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                             [--concatenate_before_resample]
                             [--estimators_per_step [5-1000]]
                             [--resampling_method {undersample_cluster_centroids,undersample_random,passed_method}]
                             [--alternative_resampling_method ALTERNATIVE_RESAMPLING_METHOD]
                             [--distance [5-30]] [--impute_value IMPUTE_VALUE]
                             [--alternative_classifier ALTERNATIVE_CLASSIFIER]
                             [--use_cleanlab] [--help] [--version]
Required arguments
--mode, -mo

Possible choices: train_new, train_existing, train_test, predict_test

choice of program

--matrices, -m

HiC-Matrix file or list of files for input. Only COOLER files are supported!

--domain_file, -d

domain file or list of files containing tad boundaries

--out_file, -o

output file for either the classification report or the saved classifier

--normalization_method, -n

Possible choices: obs_exp, range

set the normalization mode, with which the passed matrices will be normalized

--resolution, -r

resolution in bases of the classifier

Optional arguments
--threshold

threshold for protein quality check

--leniency

leniency for protein quality check. Widens peaks of protein file by leniency*resolution

Default: 0

--saved_classifier

pickled classifier to be trained or used for prediction

--unselect_border_cases

set whether genes at the border of the matrix up to set distance will not be used for training and testing

Default: True

--protein_file

provide a bed file for TAD quality control

--threads, -t

number of threads used

Default: 4

--chromosomes

Chromosomes to include in the analysis. If not set, all chromosomes are included.

--concatenate_before_resample

whether features build from matrix list are concatenated and resampled together or resampled separatly per matrix. Not important for random undersampling, but alter for other resampling methods and check if performance increases.

Default: False

--estimators_per_step

how many estimators are added in each training step for the classifier (new classifier)

Default: 20

--resampling_method

Possible choices: undersample_cluster_centroids, undersample_random, passed_method

the method used to resample the training set(new classifier)

Default: “undersample_random”

--alternative_resampling_method

pass alternative resampling method from imblearn library (new classifier)

--distance

max distance between TADs to be used in calculation (new classifier)

Default: 15

--impute_value

non-numerical float values in matrix will be replaced by this value (new classifier)

Default: -1.0

--alternative_classifier

pass custom classifier, needs to implement warm_start (new classifier)

--use_cleanlab

use Confident Learning with the cleanlab library (new classifier)

Default: False

--version

show program’s version number and exit

hicTADClassifier
Description

Uses Supervised Learning to call TAD boundaries. One or multiple HiC-Matrices can be passed, from which a BED file will be produced containing the predicted boundary positions. By default, a EasyEnsembleClassifier as described in Liu et al.: “Exploratory Undersampling for Class-Imbalance Learning” will be used to call TADs. Internally this classifier relies on Resampling, Boosting and Bagging. Passed matrices will be range normalized by default. Alternatively, obs/exp normalization can be used. Currently, only classifiers for 10kb resolution are implemented. For building own classifiers or tune existing ones, hicTrainClassifier can be used and passed with the saved_classifer argument. A simple usage example can be seen here:

$ hicTADClassifier -m my_matrix.cool -o predictions -n range

usage: hicTADClassifier --matrices MATRICES [MATRICES ...] --out_file OUT_FILE
                        [OUT_FILE ...]
                        [--normalization_method {obs_exp,range}]
                        [--saved_classifier SAVED_CLASSIFIER]
                        [--unselect_border_cases] [--threads THREADS]
                        [--chromosomes CHROMOSOMES [CHROMOSOMES ...]] [--help]
                        [--version]
Required arguments
--matrices, -m

HiC-Matrix file or list of files for input. Only COOLER files are supported!

--out_file, -o

output file path for predictions

Optional arguments
--normalization_method, -n

Possible choices: obs_exp, range

set the normalization mode, with which the passed matrices will be normalized. If not set, matrices will be range normalized

Default: “range”

--saved_classifier

Default classifier are available for 10kb, 25kb, 50kb and 100kb resolution. Do not set this parameter to use the default models. Pass a self-trained classifier (from hicTrainTADClassifier) to load a non-default model.

--unselect_border_cases

set whether genes at the border of the matrices will not be predicted

Default: False

--threads, -t

number of threads used

Default: 4

--chromosomes

Chromosomes to include in the analysis. If not set, all chromosomes are included.

--version

show program’s version number and exit

Tools for Hi-C and TADs visualization

hicPlotMatrix

Creates a heatmap of a Hi-C matrix.

usage: hicPlotMatrix --matrix MATRIX --outFileName OUTFILENAME [--title TITLE]
                     [--scoreName SCORENAME] [--perChromosome]
                     [--clearMaskedBins]
                     [--chromosomeOrder CHROMOSOMEORDER [CHROMOSOMEORDER ...]]
                     [--region REGION] [--region2 REGION2] [--log1p] [--log]
                     [--colorMap COLORMAP] [--vMin VMIN] [--vMax VMAX]
                     [--dpi DPI] [--bigwig BIGWIG [BIGWIG ...]]
                     [--bigwigAdditionalVerticalAxis]
                     [--vMinBigwig VMINBIGWIG] [--vMaxBigwig VMAXBIGWIG]
                     [--flipBigwigSign]
                     [--scaleFactorBigwig SCALEFACTORBIGWIG]
                     [--fontsize FONTSIZE] [--rotationX ROTATIONX]
                     [--rotationY ROTATIONY]
                     [--increaseFigureWidth INCREASEFIGUREWIDTH]
                     [--increaseFigureHeight INCREASEFIGUREHEIGHT]
                     [--loops LOOPS]
                     [--loopLargeRegionsOperation {first,last,center}]
                     [--tads TADS] [--help] [--version]
Required arguments
--matrix, -m

Path of the Hi-C matrix to plot.

--outFileName, -out

File name to save the image.

Optional arguments
--title, -t

Plot title.

--scoreName, -s

Score name label for the heatmap legend.

--perChromosome

Instead of plotting the whole matrix, each chromosome is plotted next to the other. This parameter is not compatible with –region.

Default: False

--clearMaskedBins

If set, masked bins are removed from the matrix and the nearest bins are extended to cover the empty space instead of plotting black lines.

Default: False

--chromosomeOrder

Chromosomes and order in which the chromosomes should be plotted. This option overrides –region and –region2.

--region

Plot only this region. The format is chr:start-end The plotted region contains the main diagonal and is symmetric unless –region2 is given.

--region2

If given, then only the region defined by –region and –region2 is given. The format is the same as –region1.

--log1p

Plot the log1p of the matrix values.

Default: False

--log

Plot the MINUS log of the matrix values.

Default: False

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html (Default: “RdYlBu_r”).

Default: “RdYlBu_r”

--vMin

Minimum score value.

--vMax

Maximum score value.

--dpi

Resolution for the image in case theoutput is a raster graphics image (e.g png, jpg) (Default: 72).

Default: 72

--bigwig

Bigwig file to plot below the matrix. This can for example be used to visualize A/B compartments or ChIP-seq data.

--bigwigAdditionalVerticalAxis

Add an additional axis to determine the values of a bigwig file in 2D better.

Default: False

--vMinBigwig

Minimum score value for bigwig.

--vMaxBigwig

Maximum score value for bigwig

--flipBigwigSign

The sign of the bigwig values are flipped. Useful if hicPCA gives inverted values.

Default: False

--scaleFactorBigwig

Scale the values of a bigwig file by the given factor (Default: 1.0).

Default: 1.0

--fontsize

Fontsize in the plot for x and y axis (Default: 10).

Default: 10

--rotationX

Rotation in degrees for the labels of x axis (Default: 0).

Default: 0

--rotationY

Rotation in degrees for the labels of y axis (Default: 0).

Default: 0

--increaseFigureWidth

Plotting additional bigwig tracks can cause compression in x or y direction of the heatmap. Set this factor to a value unequal to 0 to correct this (Default: 0.5).

Default: 0.5

--increaseFigureHeight

Plotting additional bigwig tracks can cause compression in x or y direction of the heatmap. Set this factor to a value unequal to 0 to correct this (Default: 0.5).

Default: 0.5

--loops

Bedgraph file to plot detected long range contacts from hicDetectLoops.

--loopLargeRegionsOperation

Possible choices: first, last, center

If a given coordinate in the bed file is larger than a bin of the input matrix, by default only the first bin is taken into account. However there are more possibilities to handel such a case. Users can ask for the last bin or for center of the region.

Default: “first”

--tads

Bedgraph file to plot detected tads

--version

show program’s version number and exit

Details

hicplotMatrix takes a Hi-C matrix and plots the interactions of all or some chromosomes.

Examples

Here’s an example of Hi-C data from wild-type D. melanogaster embryos.

_images/hicPlotMatrix.png

This plot shows all contacts of a Hi-C matrix. Its bins were merged into 25 kb bins using hicMergeMatrixBins. Alternatively, chromosomes can be plotted separately:

_images/hicPlotMatrix_perChromosome.png
$ hicPlotMatrix -m Dmel.h5 -o hicPlotMatrix.png \
-t 'D.melanogaster (--perChromosome)' --log1p \
--clearMaskedBins --chromosomeOrder 2L 2R 3L 3R X --perChromosome
hicPlotTADs
Description

For parameter options please see the pyGenomeTracks documentation.

Usage example

The hicPlotTADs output is similar to a genome browser screenshot that besides the usual genes and score data (like bigwig or bedgraph files) also contains Hi-C data. The plot is composed of tracks that need to be specified in a configuration file. Once the track file is ready, hicPlotTADs can be used as follows:

$ hicPlotTADs --tracks tracks.ini --region chrX:99,974,316-101,359,967 \
-t 'Marks et. al. TADs on X' -o tads.pdf
_images/marks_et-al_TADs.png
Configuration file template

The following is a template for the configuration file which is based on .ini configuration files. Each track is defined by a section header (for example [hic track]), followed by parameters specific to the section as color, title, etc. For details please see the documentation of pyGenomeTracks.

$ hicPlotTADs --tracks hic_track.ini -o hic_track.png --region chrX:2500000-3500000
[x-axis]
where = top

[hic matrix]
file = hic_data.h5
title = Hi-C data
# depth is the maximum distance plotted in bp. In Hi-C tracks
# the height of the track is calculated based on the depth such
# that the matrix does not look deformed
depth = 300000
transform = log1p
file_type = hic_matrix

[tads]
file = domains.bed
file_type = domains
border_color = black
color = none
# the tads are overlay over the hic-matrix
# the share-y options sets the y-axis to be shared
# between the Hi-C matrix and the TADs.
overlay_previous = share-y

[spacer]

[bigwig file test]
file = bigwig.bw
# height of the track in cm (optional value)
height = 4
title = ChIP-seq
min_value = 0
max_value = 30
_images/hic_track.png
hicPlotViewpoint

Plots the number of interactions around a given reference point in a region.

usage: hicPlotViewpoint --matrix MATRIX [MATRIX ...] --region REGION
                        --outFileName OUTFILENAME --referencePoint
                        REFERENCEPOINT [--chromosome CHROMOSOME]
                        [--interactionOutFileName INTERACTIONOUTFILENAME]
                        [--dpi DPI] [--version] [--help]
Required arguments
--matrix, -m

Hi-C matrix to plot.

--region

The format is chr:start-end.

--outFileName, -o

File name of the image to save.

--referencePoint, -rp

Reference point. Needs to be in the format: ‘chr:100’ for a single reference point or ‘chr:100-200’ for a reference region.

Optional arguments
--chromosome, -C

Optional parameter: Only show results for this chromosome.

--interactionOutFileName, -i

Optional parameter: If set, a bedgraph file with all interaction will be created.

--dpi

Optional parameter: Resolution for the image in case theouput is a raster graphics image (e.g png, jpg) (Default: 300).

Default: 300

--version

show program’s version number and exit

hicAggregateContacts
Background

hicAggregateContacts is a tool that allows plotting of aggregated Hi-C sub-matrices of a specified list of positions. Positions of interest can for example be binding sites of a specific protein that were determined by ChIP-seq or genetic elements as transcription start sites of active genes.

Description

Takes a list of positions in the Hi-C matrix and makes a pooled image.

usage: hicAggregateContacts --matrix MATRIX --outFileName OUTFILENAME --BED
                            BED --mode {inter-chr,intra-chr,all}
                            [--range RANGE] [--row_wise] [--BED2 BED2]
                            [--numberOfBins NUMBEROFBINS]
                            [--transform {total-counts,z-score,obs/exp,none}]
                            [--operationType {sum,mean,median}] [--perChr]
                            [--largeRegionsOperation {first,last,center}]
                            [--help] [--version]
                            [--outFilePrefixMatrix OUTFILEPREFIXMATRIX]
                            [--outFileContactPairs OUTFILECONTACTPAIRS]
                            [--outFileObsExp OUTFILEOBSEXP]
                            [--diagnosticHeatmapFile DIAGNOSTICHEATMAPFILE]
                            [--kmeans KMEANS] [--hclust HCLUST]
                            [--spectral SPECTRAL]
                            [--howToCluster {full,center,diagonal}]
                            [--keep_outlier] [--max_deviation MAX_DEVIATION]
                            [--chromosomes CHROMOSOMES [CHROMOSOMES ...]]
                            [--colorMap COLORMAP] [--plotType {2d,3d}]
                            [--vMin VMIN] [--vMax VMAX] [--dpi DPI]
Required arguments
--matrix, -m

Path of the Hi-C matrix to plot.

--outFileName, -out

File name to save the image.

--BED

Interactions between regions in this BED file are plotted.

--mode

Possible choices: inter-chr, intra-chr, all

Optional arguments
--range

Range of contacts that will be considered for plotting the aggregate contacts in bp with the format low_range:high_range for example 1000000:20000000. The range should start at contacts larger than TAD size to reduce background interactions. This will be ignored if inter-chromosomal contacts are of interest.

--row_wise

If given,the insteractions between each row of the BED file and its corresponding row of the BED2 file are computed. If intra-chromosomal contacts are computed, the rows with different chromosomes are ignored. If inter-chromosomal, the rows with same chromosomes are ignored. It keeps all the rows if all.

Default: False

--BED2

Optional second BED file. Interactions between regions in first and second BED file are plotted.

--numberOfBins

Number of bins to include in the submatrix. The bed regions will be centered between half number of bins and the other half number of bins (Default: 51).

Default: 51

--transform

Possible choices: total-counts, z-score, obs/exp, none

Type of transformation for the matrix. The options are “none”, “total-counts”, “z-score” and “obs/exp”. If total counts are selected, the sub-matrix values are divided by the total counts for normalization. If z-score or obs/exp are selected, the Hi-C matrix is converted into a z-score or observed / expected matrix (Default: “none”).

Default: “none”

--operationType

Possible choices: sum, mean, median

Type of the operation to be applied to summerize the submatrices into a single matrix. Options are sum, mean and median. (Default: “median”)

Default: “median”

--perChr

if set, it generates a plot per chromosome. It is only affected if intra-chromosomal contacts are of interest.

Default: False

--largeRegionsOperation

Possible choices: first, last, center

If a given coordinate in the bed file is larger than a bin of the input matrix, by default only the first bin is taken into account. However there are more possibilities to handel such a case. Users can ask for the last bin or for center of the region. As an example if a region falls into bins [4,5,6] and –numberOfBins = 2 then if first, bins [3,4,5] are kept. If last: [5,6,7] and if center: [4,5,6].

Default: “first”

--version

show program’s version number and exit

--dpi

Optional parameter: Resolution for the image in case theoutput is a raster graphics image (e.g png, jpg) (Default: 300).

Default: 300

Output options
--outFilePrefixMatrix

If this option is set, the values underlying the output matrix will be saved to tab-delimited tables (one per chromosome) using the indicated prefix, for example TSS_to_TSS_chrX.tab. If clustering is performed, the values are saved including the cluster_id in the file TSS_to_TSS_chrX_cluster_1.tab.

--outFileContactPairs

Output file prefix. If this option is set, the position of the contact positions are saved as (chrom1, start1, end1, chrom2, start2, end2) where chrom_n, start_n, end_n correspond to the pair of positions used to compute the submatrix. The data is saved per chromosome and per cluster separately (one file each). The format is {prefix}_{chrom}_{cluster_id}.tab. If no clusters were computed, then only one file per chromosome is produced.

--outFileObsExp

writes the obs/exp matrix to a file, if –transform=obs/exp.

--diagnosticHeatmapFile

If given, a heatmap (per chromosome) is saved. Each row in the heatmap contains thediagonal of each of the submatrices centered on the bed file. This file is useful to get an idea of the values that are used for the aggregate matrix and to determine the fraction of sub-matrices that are aggregated that may have an enrichment at the center.

Clustering options
--kmeans

Number of clusters to compute. When this option is set, the submatrices are split into clusters (per chromosome) using the k-means algorithm.

--hclust

Number of clusters to compute (per chromosome). When this option is set, the matrix is split into clusters using the hierarchical clustering algorithm, using “ward linkage”. –hclust could be very slow if you have >1000 submatrices per chromosome. In those cases, you might prefer –kmeans

--spectral

Number of clusters to compute (per chromosome). When this option is set, the matrix is split into clusters using the spectral clustering algorithm.

--howToCluster

Possible choices: full, center, diagonal

Options are “full”, “center” and “diagonal”. The full clustering takes all values of each submatrix for clustering. “center”, takes only a square of length 3x3 from each submatrix and uses only this values for clustering. With the “diagonal” option the clustering is only carried out based on the submatrix diagonal (representing values at the same distance to each other) (Default: “full”).

Default: “full”

--keep_outlier

keep outliers before clustering. They will be removed by default. Note that removing outliers is also applied to the case where the number of clusters is one (e.g. no clustering is required) unless keep_outlier is set.

Default: False

--max_deviation

max deviation from mean to be determined as outlier.

Default: 2

Plotting options
--chromosomes, -C

List of chromosomes to plot.

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html (Default: “RdYlBu_r”).

Default: “RdYlBu_r”

--plotType

Possible choices: 2d, 3d

Plot type (Default: “2d”).

Default: “2d”

--vMin

Minimum value of the plotted score.

--vMax

Maximum value of the plotted score.

Usage example

Below, you can find an example of an aggregate Hi-C matrix obtained from Drosophila melanogaster Hi-C data. The interactions are plotted at binding sites of a protein that were determined by ChIP-seq. We plot sub-matrices of 30 bins (1.5 kb bin size, 45 kb in total). The regions specified in the BED file will be centered between half number of bins and the other half number of bins.The considered range is 300-1000 kb. The range should be adjusted and only contain contacts larger than TAD size to reduce background interactions.

$ hicAggregateContacts --matrix Dmel.h5 --BED ChIP-seq-peaks.bed \
--outFileName Dmel_aggregate_Contacts --vMin 0.8 --vMax 2.2 \
--range 300000:1000000 --numberOfBins 30 --chromosomes X \
--operationType mean --transform obs/exp --mode intra-chr
_images/hicAggregateContacts.png

This example was calculated using mean interactions of an observed vs expected transformed Hi-C matrix. Additional options for the matrix transformation are total-counts or z-score. Aggregate contacts can be plotted in 2D or 3D.

hicPlotAverageRegions

hicPlotAverage regions plots the data computed by hicAverageRegions. It shows the summed up and averaged regions around all given reference points. This tool is useful to plot differences at certain reference points as for example TAD boundaries between samples.

usage: hicPlotAverageRegions --matrix MATRIX --outputFile OUTPUTFILE [--log1p]
                             [--log] [--colorMap COLORMAP] [--vMin VMIN]
                             [--vMax VMAX] [--dpi DPI] [--help] [--version]
Required arguments
--matrix, -m

The averaged regions file computed by hicAverageRegions (npz file).

--outputFile, -o

The averaged regions plot.

Optional arguments
--log1p

Plot log1p of the matrix values.

Default: False

--log

Plot log of the matrix values.

Default: False

--colorMap

Color map to use for the heatmap. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html (Default: “hot_r”).

Default: “hot_r”

--vMin

Minimum score value.

--vMax

Maximum score value.

--dpi

Resolution of image ifouput is a raster graphics image (e.g png, jpg) (Default: 300).

Default: 300

--version

show program’s version number and exit

See hicAverageRegions for an example usage.

Hi-C contact matrix handling

hicConvertFormat

Conversion of Hi-C matrices of different file formats. We support the conversion of hic to cool format via hic2cool, and homer, HicPro, h5 and cool format to h5, cool, homer or ginteractions format. Moreover, hicConvertFormat accepts multiple input files from one format with different resolutions and creates a mcool file. Each original file is stored under the path e.g. ::/resolutions/10000. A batch computation is possible, the number of input files and output files needs to match, all input files need to be of the same format type and all output files too. For input and output of cooler files special options are available, for all other formats they will be ignored.HiCPro file format needs an additional bed file as input.

usage: hicConvertFormat --matrices MATRICES [MATRICES ...] --outFileName
                        OUTFILENAME [OUTFILENAME ...] --inputFormat
                        {h5,cool,hic,homer,hicpro,2D-text} --outputFormat
                        {cool,h5,homer,ginteractions,mcool,hicpro}
                        [--correction_name CORRECTION_NAME]
                        [--correction_division] [--store_applied_correction]
                        [--chromosome CHROMOSOME] [--enforce_integer]
                        [--load_raw_values]
                        [--resolutions RESOLUTIONS [RESOLUTIONS ...]] [--help]
                        [--chromosomeSizes txt file] [--version]
                        [--bedFileHicpro BEDFILEHICPRO [BEDFILEHICPRO ...]]
Required arguments
--matrices, -m

input file(s). Could be one or many files.

--outFileName, -o

File name to save the exported matrix.

--inputFormat

Possible choices: h5, cool, hic, homer, hicpro, 2D-text

File format of the input matrix file. The following options are available: h5 (native HiCExplorer format based on hdf5 storage format), cool, hic, homer, hicpro, 2D-text.

--outputFormat

Possible choices: cool, h5, homer, ginteractions, mcool, hicpro

Output format. The following options are available: h5 (native HiCExplorer format based on hdf5 storage format). cool, ginteractions, homer, mcool and hicpro (Default: “cool”).

Default: “cool”

Optional arguments
--correction_name

Name of the column which stores the correction factors. The information about the column names can be figured out with the tool hicInfo. Option only for cool input files (Default: “weight”).

Default: “weight”

--correction_division

If set, division is applied for correction. Default is a multiplication. Option only for cool input files.

Default: False

--store_applied_correction

Store the applied correction and do not set correction factors. Option only for cool input files.

Default: False

--chromosome

Load only one chromosome. Option only for cool input files.

--enforce_integer

Enforce datatype of counts to integer. Option only for cool input files.

Default: False

--load_raw_values

Load only ‘count’ data and do not apply a correction. Option only for cool input files.

Default: False

--resolutions, -r

List of resolutions that should be added.

--chromosomeSizes, -cs

This option is for the input format 2D-text only and will be ignored else.File with the chromosome sizes for your genome. A tab-delimited two column layout “chr_name size” is expectedPlease consider that this option causes that only reads are considered which are on the listed chromosomes.Use this option to guarantee fixed sizes. An example file is available via UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/dm3.chrom.sizes

--version

show program’s version number and exit

--bedFileHicpro, -bf

Bed file(s) of hicpro file format.

Background

To reproduce analyses and to compare and use different Hi-C analysis software, conversion between interaction matrix file formats is crucial. However, most Hi-C softwares are only supporting their own data format which makes the exchange difficult. HiCExplorer supports a range of interaction matrices, both for import and for export.

Import:
  • hic

  • cool

  • h5

  • homer

  • HicPro

Export:
  • cool

  • mcool

  • h5

  • homer

  • ginteractions

With HiCExplorer version 2.2 hicConvertFormat and hicAdjustMatrix replace hicExport from HiCExplorer 2.1 and older versions.

Usage example
hic2cool

HiCExplorer uses the library hic2cool to convert .hic interaction matrix files to the cool format. Usually .hic files have the three correction factors KR, VC or VC_SQRT; however these cannot be applied natively by HiCExplorer tools because HiCExplorer expects the correction values to be stored in the column weight. To work with corrected data the correction factors need to applied separately, see section cool to cool.

The following example will convert a hic file which contains the resolution of 1000 to a cool file with 10kb resolution. The desired resolution needs to be existing in the hic file. If no resolution parameter is defined, a mcool file with all available resolutions is created.

$ hicConvertFormat -m matrix.hic --inputFormat hic --outputFormat cool -o matrix.cool --resolutions 10000

It is only possible to convert from hic to cool format, no other formats are supported.

cool to cool

The cool file format is developed and maintained by the Mirny lab and allows to access interaction matrices in an easy to use data format.

The cool data format allows to use the following options:

  • correction_name: In case correction factors are not stored in ‘weight’ the correct column name can be defined using this parameter and the resulting matrix will store the values in ‘weight’.

  • correction_division: Correction factors can be applied by a multiplication or a division. The default behaviour is to use the multiplication, in case the correction factors are inverted, set this parameter.

  • store_applied_correction: Set this parameter if correction factors should be applied on the data and should be written back to colum ‘counts’ in the corrected form and not as raw. Default: not set.

  • chromosomes: Define a list of chromosomes which should be included in the output matrix. All chromosomes which are not defined are not part of the new matrix. This parameter can speed up the processing especially if only one chromosome is used.

  • enforce_integer: Raw interaction data is stored as integers, after the correction is applied the data is a float. Set a this parameter to enforce integer values in the new matrix.

  • load_raw_values: Set this parameter if the interaction data should not be loaded with the correction factors.

Example usage
$ hicConvertFormat -m matrix.cool --inputFormat cool --outputFormat cool -o matrix.cool --correction_name KR
Homer

Homer is a software for ‘motif discovery and next generation sequencing analysis’ and supports Hi-C. HiCExplorer is able to read and write the interaction matrix from Homer. Homer stores the interaction matrix in a simple text file as a dense matrix. To write large matrices in Homer format needs a lot of space and can take a few ours to days.

Example usage
$ hicConvertFormat -m matrix.homer --inputFormat homer --outputFormat cool -o matrix.cool
Hic-Pro

HiC-Pro file format needs an additional bed file as input:

Example usage
$ hicConvertFormat -m matrix.hicpro --bedFileHicpro hicpro.bed --inputFormat hicpro --outputFormat cool -o matrix.cool
Create a mcool file

With HiCExplorer it is possible to create a multiple cool (mcool) file. These mcool files can be used e.g. with HiGlass.

To create an mcool file, use as input either one matrix in one of the supported read formats and define the desired resolutions or define multiple input matrices. In the second case, the matrices should all have different resolutions.

Example usage

The resolutions need to be a multiple of the input matrix i.e. matrix with 10kb, 20kb and 30kb are possible but not 35kb.

$ hicConvertFormat -m matrix.cool --inputFormat cool --outputFormat mcool
   -o multi_matrix.mcool --resolutions 20000 40000 70000 120000 500000
$ hicConvertFormat -m matrix10kb.cool matrix20kb.cool matrix30kb.cool
    --inputFormat cool --outputFormat mcool -o multi_matrix.mcool

The mcool matrix contains the individual matrices as follows:

multi_matrix.mcool::/resolutions/10000
multi_matrix.mcool::/resolutions/40000
multi_matrix.mcool::/resolutions/70000
multi_matrix.mcool::/resolutions/120000
multi_matrix.mcool::/resolutions/500000
hicAdjustMatrix

This tool adjusts hic matrices by keeping, removing or masking a given list of regions or chromosmes.

usage: hicAdjustMatrix --matrix MATRIX --outFileName OUTFILENAME
                       [--chromosomes CHROMOSOMES [CHROMOSOMES ...] |
                       --regions REGIONS | --maskBadRegions MASKBADREGIONS]
                       [--action {keep,remove,mask}]
                       [--interIntraHandling {None,inter,intra}] [--help]
                       [--version]
Named Arguments
--chromosomes, -c

List of chromosomes to keep/remove.

--regions, -r

BED file which stores a list of regions to keep/remove.

--maskBadRegions, -mbr

Bad regions are identified and masked.

Required arguments
--matrix, -m

The Hi-C matrix to adjust. HiCExplorer supports the following file formats: h5 (native HiCExplorer format) and cool.

--outFileName, -o

File name to save the adjusted matrix.

Optional arguments
--action, -a

Possible choices: keep, remove, mask

Keep, remove or mask the list of specified chromosomes/regions. keep/remove: These options keep/remove bins of matrix by deleting them. This may cause issue plotting the matrix if several parts of a single chromosome are going to be deleted. In that case, one may consider using the mask option (Default: “keep”).

Default: “keep”

--interIntraHandling, -iih

Possible choices: None, inter, intra

Remove the inter- or intra-chromosomal contacts of the given chromosomes. (Default: None).

--version

show program’s version number and exit

hicAdjustMatrix can mask, remove or keep defined regions from a BED file or given chromosomes. This can be useful to create a smaller Hi-C matrix of a single chromosome for e.g. testing purposes or to remove repetitive chromosome ends before calculating A/B compartments using hicPCA.

Example usages
$ hicAdjustMatrix -m matrix.cool --action keep --chromosomes chr1 -o matrix_chr1.cool
$ hicAdjustMatrix -m matrix.cool --action mask --regions mask_regions.bed -o matrix_masked.cool

mask_regions.bed

chr1    10  30
chr1    50  300

Capture Hi-C analysis

chicQualityControl

Computes the sparsity of each viewpoint to determine the quality. A viewpoint is considered to be of bad quality if it is too sparse i.e. if there are too many locations with no interactions recorded.

This script creates three output files: a plot with the sparsity distribution per matrix, a plot with the sparsity distribution as histograms and a filtered reference points file.

An example usage is:

$ chicQualityControl -m matrix1.cool matrix2.cool -rp referencePointsFile.bed –range 20000 40000 –sparsity 0.01 -o referencePointFile_QC_passed.bed

usage: chicQualityControl --matrices MATRICES [MATRICES ...] --referencePoints
                          REFERENCEPOINTS --sparsity SPARSITY
                          [--outFileName OUTFILENAME]
                          [--outFileNameHistogram OUTFILENAMEHISTOGRAM]
                          [--outFileNameSparsity OUTFILENAMESPARSITY]
                          [--threads THREADS] [--fixateRange FIXATERANGE]
                          [--dpi DPI] [--help] [--version]
Required arguments
--matrices, -m

The input matrices to apply the QC on.

--referencePoints, -rp

Bed file contains all reference points which are checked for a sufficient number of interactions.

--sparsity, -s

Viewpoints with a sparsity less than the value given are considered of bad quality. If multiple matrices are given, the viewpoint is removed as soon as it is of bad quality in at least one matrix.

Optional arguments
--outFileName, -o

The output file name of the passed reference points. Used as prefix for the plots as well (Default: “new_referencepoints.bed”).

Default: “new_referencepoints.bed”

--outFileNameHistogram, -oh

The output file for the histogram plot (Default: “histogram.png”).

Default: “histogram.png”

--outFileNameSparsity, -os

The output file for the sparsity distribution plot (Default: “sparsity.png”).

Default: “sparsity.png”

--threads, -t

Number of threads (Default: 4).

Default: 4

--fixateRange, -fs

Fixate score of background model starting at distance x. E.g. all values greater than 500kb are set to the value of the 500kb bin (Default: 500000).

Default: 500000

--dpi

Optional parameter: Resolution for the image if theoutput is a raster graphics image (e.g png, jpg) (Default: 300).

Default: 300

--version

show program’s version number and exit

chicViewpointBackgroundModel

chicViewpointBackgroundModel computes a background model for all given samples with all reference points. For all relative distances to a reference point a negative binomial distribution is fitted. In addition, for each relative distance to a reference point the average value for this location is computed. Both background models are used, the first one for p-value and significance computation, the second one to filter out interactions with a smaller x-fold over the mean.

The background distributions are fixed at –fixateRange, i.e. all distances lower or higher than this value use the fixed background distribution.

An example usage is:

$ chicViewpointBackgroundModel –matrices matrix1.cool matrix2.cool matrix3.cool –referencePoints referencePointsFile.bed –range 20000 40000 –outFileName background_model.bed

usage: chicViewpointBackgroundModel --matrices MATRICES [MATRICES ...]
                                    --referencePoints REFERENCEPOINTS
                                    [--averageContactBin AVERAGECONTACTBIN]
                                    [--truncateZeros]
                                    [--outFileName OUTFILENAME]
                                    [--threads THREADS]
                                    [--fixateRange FIXATERANGE] [--help]
                                    [--version]
Required arguments
--matrices, -m

The input matrices (samples) to build the background model on.

--referencePoints, -rp

Bed file contains all reference points which should be used to build the background model.

Optional arguments
--averageContactBin

Average the contacts of n bins via a sliding window approach (Default: 5).

Default: 5

--truncateZeros, -tz

Truncates the zeros before the distributions are fitted. Use it in case you observe an over dispersion.

Default: False

--outFileName, -o

The name of the background model file (Default: “background_model.txt”).

Default: “background_model.txt”

--threads, -t

Number of threads (uses the python multiprocessing module) (Default: 4).

Default: 4

--fixateRange, -fs

Fixate score of backgroundmodel starting at distance x. E.g. all values greater 500kb are set to the value of the 500kb bin (Default: 500000).

Default: 500000

--version

show program’s version number and exit

chicViewpoint

Computes per input matrix all viewpoints which are defined in the reference points file.

usage: chicViewpoint --matrices MATRICES [MATRICES ...] --range RANGE RANGE
                     --referencePoints REFERENCEPOINTS --backgroundModelFile
                     BACKGROUNDMODELFILE [--outFileName OUTFILENAME]
                     [--threads THREADS]
                     [--averageContactBin AVERAGECONTACTBIN]
                     [--fixateRange FIXATERANGE] [--help] [--version]
Required arguments
--matrices, -m

Path to the Hi-C matrices which store the captured Hi-C data per sample.

--range

Defines the region upstream and downstream of a reference point which should be considered in the analysis. Please remember to use the same fixate range setting as for the background model computation and that distances of the range larger than the fixate range use the background model of those.Format is –region upstream downstream

--referencePoints, -rp

Reference point file. Needs to be in the format: ‘chr 100’ for a single reference point or ‘chr 100 200’ for a reference region and with a single reference point per line

--backgroundModelFile, -bmf

path to the background file computed by chicViewpointBackgroundModel

--outFileName, -o

This hdf5 file contains all created viewpoint files.

Default: “chic_files.hdf5”

Optional arguments
--threads, -t

Number of threads (uses the python multiprocessing module) (Default: 4).

Default: 4

--averageContactBin

Average the contacts of n bins via a sliding window approach to smooth the values and be less sensitive for outliers (Default: 5).

Default: 5

--fixateRange, -fs

Fixate range of background model starting at distance x. E.g. all values greater 500kb are set to the value of the 500kb bin (Default: 500000).

Default: 500000

--version

show program’s version number and exit

chicSignificantInteractions

Per viewpoint the significant interactions are detected based on the background model. For each viewpoint file, an output file is created with all recorded significant interactions and a target file. The target file is especially useful in the batch mode context; for two consecutive listed control and treatment viewpoints it merges the significant interactions which can then be used to test for a differential interaction scheme.

chicSignificantInteractions supports two modes to detect significant interactions, either by an x-fold over the average background or a loose p-value. In both cases neighboring significant peaks are merged together and an additional p-value is computed based on the sum of interactions for this neighborhood. Only interactions with a higher p-value (as specified by the threshold –pValue) are accepted as a significant interaction.

usage: chicSignificantInteractions --interactionFile INTERACTIONFILE --pValue
                                   PVALUE
                                   [--xFoldBackground XFOLDBACKGROUND | --loosePValue LOOSEPVALUE]
                                   --backgroundModelFile BACKGROUNDMODELFILE
                                   --range RANGE RANGE
                                   [--outFileNameSignificant OUTFILENAMESIGNIFICANT]
                                   [--outFileNameTarget OUTFILENAMETARGET]
                                   [--combinationMode {dual,single}]
                                   [--threads THREADS] [--truncateZeroPvalues]
                                   [--fixateRange FIXATERANGE]
                                   [--peakInteractionsThreshold PEAKINTERACTIONSTHRESHOLD]
                                   [--help] [--version]
Named Arguments
--xFoldBackground, -xf

Filter x-fold over background. Used to merge neighboring bins with a broader peak but less significant interactions to a single peak with high significance. Used only for pValue option.

--loosePValue, -lp

loose p-value threshold to filter target regions in a first round. Used to merge neighboring bins with a broader peak but less significant interactions to a single peak with high significance. Used only for pValue option.

Required arguments
--interactionFile, -if

path to the interaction file (HDF5) which should be used for aggregation of the statistics.

--pValue, -p

p-value threshold to filter target regions for inclusion in differential analysis.

--backgroundModelFile, -bmf

path to the background file.

--range

Defines the region upstream and downstream of a reference point which should be included. Format is –region upstream downstream, e.g. –region 500000 500000 plots 500kb up- and 500kb downstream. This value should not exceed the range used in the other chic-tools.

Optional arguments
--outFileNameSignificant, -os

File name suffix to save the results; prefix is the input file name (Default: “significantInteractions.hdf5”).

Default: “significantInteractions.hdf5”

--outFileNameTarget, -ot

The file to store the target data (Default: “targetFile.hdf5”).

Default: “targetFile.hdf5”

--combinationMode, -cm

Possible choices: dual, single

This option defines how the interaction data should be computed and combined: dual: Combines as follows: [[matrix1_gene1, matrix2_gene1], [matrix2_gene1, matrix3_gene1],[matrix1_gene2, matrix2_gene2], …]single: Combines as follows: [matrix1_gene1, matrix1_gene2, matrix2_gene1, …], (Default: “dual”).

Default: “dual”

--threads, -t

Number of threads (uses the python multiprocessing module) (Default: 4).

Default: 4

--truncateZeroPvalues, -tzpv

Sets all p-values which are equal to zero to one. This has the effect that the associated positions are not part of the significance decision.

Default: False

--fixateRange, -fs

Fixate range of backgroundmodel starting at distance x. E.g. all values greater than 500kb are set to the value of the 500kb bin (Default: 500000).

Default: 500000

--peakInteractionsThreshold, -pit

The minimum number of interactions a detected peak needs to have to be considered (Default: 5).

Default: 5

--version

show program’s version number and exit

chicAggregateStatistic

chicAggregateStatistic is a preprocessing tool for chicDifferentialTest. It takes two consecutive viewpoint files and one target file and creates one file containing all locations which should be tested for differential interactions. Either one target file for two consecutive viewpoint files or one target file for all viewpoints is accepted.

usage: chicAggregateStatistic --interactionFile INTERACTIONFILE
                              [--targetFile TARGETFILE]
                              [--outFileName OUTFILENAME] [--threads THREADS]
                              [--help] [--version]
Required arguments
--interactionFile, -if

path to the interaction files which should be used for aggregation of the statistics.

--targetFile, -tf

path to the target files which contains the target regions to prepare data for differential analysis. This is either the target file in the hdf format created by chicSignificantInteractions or a regular, three column bed file.

Optional arguments
--outFileName, -o

File name to save the result (Default: “aggregate_target.hdf”).

Default: “aggregate_target.hdf”

--threads, -t

Number of threads (uses the python multiprocessing module)ist (Default: 4).

Default: 4

--version

show program’s version number and exit

chicDifferentialTest

chicDifferentialTest tests if two locations under consideration of the reference point have a different interaction count. For this either Fisher’s test or the chi2 contingency test can be used. The file that is accepted for this test can be created with chicAggregateStatistic. H0 assumes the interactions are not different. Therefore the differential interaction counts are all where H0 was rejected.

usage: chicDifferentialTest --aggregatedFile AGGREGATEDFILE --alpha ALPHA
                            [--outFileName OUTFILENAME]
                            [--statisticTest {fisher,chi2}]
                            [--threads THREADS] [--help] [--version]
Required arguments
--aggregatedFile, -af

path to the aggregated files which should be used for the differential test.

--alpha, -a

define a significance level (alpha) for accepting samples

Optional arguments
--outFileName, -o

Output file for the differential results (Default: “differentialResults.hdf5”).

Default: “differentialResults.hdf5”

--statisticTest

Possible choices: fisher, chi2

Type of test used: fisher’s exact test or chi2 contingency (Default: “fisher”).

Default: “fisher”

--threads, -t

Number of threads (uses the python multiprocessing module) (Default: 4).

Default: 4

--version

show program’s version number and exit

chicPlotViewpoint

chicPlotViewpoint plots one or many viewpoints with the average background model and the computed p-value per sample. In addition, it can highlight differential interactions of two samples and/or significant regions.

usage: chicPlotViewpoint --interactionFile INTERACTIONFILE --range RANGE RANGE
                         [--backgroundModelFile BACKGROUNDMODELFILE]
                         [--differentialTestResult DIFFERENTIALTESTRESULT]
                         [--significantInteractions SIGNIFICANTINTERACTIONS]
                         [--plotSignificantInteractions]
                         [--outFileName OUTFILENAME]
                         [--outputFormat OUTPUTFORMAT] [--dpi DPI]
                         [--combinationMode {dual,single,allGenes,oneGene}]
                         [--combinationName COMBINATIONNAME]
                         [--colorMapPvalue COLORMAPPVALUE]
                         [--maxPValue MAXPVALUE] [--minPValue MINPVALUE]
                         [--pValue]
                         [--pValueSignificanceLevels PVALUESIGNIFICANCELEVELS [PVALUESIGNIFICANCELEVELS ...]]
                         [--xFold XFOLD] [--truncateZeroPvalues]
                         [--colorList COLORLIST [COLORLIST ...]]
                         [--threads THREADS] [--help] [--version]
Required arguments
--interactionFile, -if

path to the interaction files which should be used for plotting

--range

Defines the region upstream and downstream of a reference point which should be included. Format is –region upstream downstream, e.g.: –region 500000 500000 plots 500kb up- and 500kb downstream. This value should not exceed the range used in the other chic-tools.

Optional arguments
--backgroundModelFile, -bmf

path to the background file which should be used for plotting

--differentialTestResult, -dif

Path to the H0 rejected files to highlight the regions in the plot.

--significantInteractions, -si

Path to the files with detected significant interactions to highlight the regions in the plot.

--plotSignificantInteractions, -psi

Highlights the significant interactions in the plot itself. If not set, only the p-values are updated

Default: False

--outFileName, -o

Output tar.gz of the files (Default: “plots.tar.gz”).

Default: “plots.tar.gz”

--outputFormat, -format

Output format of the plot (Default: “png”).

Default: “png”

--dpi

Optional parameter: Resolution for the image, ifoutput is a raster graphics image (e.g png, jpg) (Default: 300).

Default: 300

--combinationMode, -cm

Possible choices: dual, single, allGenes, oneGene

This option defines how the interaction data should be computed and combined: dual: Combines as follows: [[matrix1_gene1, matrix2_gene1], [matrix2_gene1, matrix3_gene1],[matrix1_gene2, matrix2_gene2], …]single: Combines as follows: [matrix1_gene1, matrix1_gene2, matrix2_gene1, …], allGenes: Combines as follows: [[matrix1_gene1, matrix2_gene1, matrix2_gene1], [matrix1_gene2, matrix2_gene2, matrix3_gene2], …]oneGene: Computes all data of one gene, please specify ‘–’. If a gene is not unique, each viewpoint is treated independently. (Default: “dual”).

Default: “dual”

--combinationName, -cn

Gene name or file name for modes ‘oneGene’ or ‘file’ of parameter ‘–combinationMode’ (Default: None).

--colorMapPvalue

Color map to use for the p-value. Available values can be seen here: http://matplotlib.org/examples/color/colormaps_reference.html (Default: “RdYlBu”).

Default: “RdYlBu”

--maxPValue, -map

Maximal value for p-value. Values above this threshold are set to this value (Default: 0.1).

Default: 0.1

--minPValue, -mp

Minimal value for p-value. Values below this threshold are set to this value (Default: 0.0).

Default: 0.0

--pValue, -p

Plot p-values as a colorbar

Default: False

--pValueSignificanceLevels, -psl

Highlight the p-values by the defined significance levels.

--xFold, -xf

Plot x-fold region for the mean background.

--truncateZeroPvalues, -tzpv

Sets all p-values which are equal to zero to one.

Default: False

--colorList, -cl

Colorlist for the viewpoint lines (Default g b c m y k).

Default: [‘g’, ‘b’, ‘c’, ‘m’, ‘y’, ‘k’]

--threads, -t

Number of threads (uses the python multiprocessing module) (Default: 4).

Default: 4

--version

show program’s version number and exit

For single-cell Hi-C data analysis please use scHiCExplorer .

tool

type

input files

main output file(s)

application

hicFindRestSites

preprocessing

1 genome FASTA file

bed file with restriction site coordinates

Identifies the genomic locations of restriction sites

hicBuildMatrix

preprocessing

2 BAM/SAM files

hicMatrix object

Creates a Hi-C matrix using the aligned BAM files of the Hi-C sequencing reads

hicCorrectMatrix

preprocessing

hicMatrix object

normalized hicMatrix object

Uses iterative correction or Knight-Ruiz to remove biases from a Hi-C matrix

hicMergeMatrixBins

preprocessing

hicMatrix object

hicMatrix object

Merges consecutive bins on a Hi-C matrix to reduce resolution

hicSumMatrices

preprocessing

2 or more hicMatrix objects

hicMatrix object

Adds Hi-C matrices of the same size

hicNormalize

preprocessing

multiple Hi-C matrices

multiple Hi-C matrices

Normalize data to 0 to 1 range or to smallest total read count

hicCorrelate

analysis

2 or more hicMatrix objects

a heatmap/scatter plot

Computes and visualizes the correlation of Hi-C matrices

hicFindTADs

analysis

hicMatrix object

bedGraph file (TAD score), a boundaries.bed file, a domains.bed file (TADs)

Identifies Topologically Associating Domains (TADs)

hicMergeDomains

analysis

multiple TAD domain files

tad domain file with merged tad locations multiple files with plotted TAD relations

Merges detect TADs locations of different resolutions; hierarchical relation between TADs as multiple plots

hicDifferentialTAD

analysis

two Hi-C matrices one TAD domain file

two diff_tad files: accepted H0 and rejected H0. Similar to BED6

Identifies differential Topologically Associating Domains (TADs) between two Hi-C matrices

hicInterIntraTAD

analysis

Hi-C matrix, TAD boundaries

data file and a plot

Computes and extracts TAD data. Creates a plot for the inter/intra contact ratio.

hicTrainTADClassifier

analysis

Hi-C matrix, TAD boundaries

ML model for hicTADClassifier

Computes a ML model for hicTADClassifier.

hicTADClassifier

analysis

Hi-C matrix, (ML model)

predicted TAD boundaries

Computes TAD boundaries based on a ML model from hicTrainTADClassifier

hicPlotMatrix

visualization

hicMatrix object

a heatmap of Hi-C contacts

Plots a Hi-C matrix as a heatmap

hicPlotTADs

visualization

hicMatrix object, a config file

Hi-C contacts on a given region, along with other provided signal (bigWig) or regions (bed) file

Plots TADs as a track that can be combined with other tracks (genes, signal, interactions)

hicPlotDistVsCounts

visualization

hicMatrix object

log log plot of Hi-C contacts per distance

Quality control

hicConvertFormat

data integration

one/multiple Hi-C file formats

Hi-C matrices/outputs in several formats

Convert matrix to different formats

hicAdjustMatrix

data integration

one Hi-C file formats

Hi-C matrix

Removes, masks or keeps specified regions of a matrix

hicInfo

information

one or more hicMatrix objects

Screen info

Prints information about matrices, like size, maximum, minimum, bin size, etc.

hicPCA

analysis

one Hi-C matrix

bedgraph or bigwig file(s) for each eigenvector

Computes for A / B compartments the eigenvectors

hicTransform

analysis

one Hi-C matrix

Hi-C matrix

Computes a obs_exp matrix like Lieberman-Aiden (2009), a pearson correlation matrix and or a covariance matrix. These matrices can be used for plotting.

hicPlotViewpoint

visualization

one Hi-C matrix

A viewpoint plot

A plot with the interactions around a reference point or region.

hicQC

information

log files from hicBuildMatrix

A quality control report

Quality control of the created contact matrix.

hicQuickQC

information

2 BAM/SAM files

An estimated quality control report

Estimated quality report of the Hi-C data.

hicCompareMatrices

analysis

two Hi-C matrices

one Hi-C matrix

Applies diff, ratio or log2ratio on matrices to compare them.

hicAverageRegions

analysis

multiple Hi-C matrices

one npz object

Averages the given locations. Visualization with hicPlotAverageRegions

hicDetectLoops

analysis

one Hi-C matrices

bedgraph file with loop locations

Detects enriched regions. Visualization with hicPlotmatrix and –loop parameter.

hicValidateLocations

analysis

one loop, one protein peak file

bedgraph file with matched loop locations, one file with loop / protein statistics

Matches loop locations with protein peak positions

hicMergeLoops

analysis

multiple loop files

bedgraph file with merged loop locations

Merges detect loop locations of different resolutions

hicHyperoptDetectLoops

analysis

one Hi-C matrix, protein peaks

best parameter setting

Search for best parameter setting for hicDetectLoops

hicHyperoptDetectLoopsHiCCUPS

analysis

one Hi-C matrix, protein peaks

best parameter setting

Search for best parameter setting for Juicer’s HiCCUPS

hicCompartmentalization

visualization

one Hi-C interaction matrix one PCA bedgraph file

one image polarization plot

The global compartmentalization signal.

hicPlotAverageRegions

visualization

one npz file

one image

Visualization of hicAverageRegions.

hicPlotSVL

analysis

one / multiple Hi-C matrices

one image, p-values file, raw data file

Computes short/long range contacts; a box plot, a p-value and raw data file

hicMergeTADbins

preprocessing

one Hi-C matrix, one BED file

one Hi-C matrix

Uses a BED file of domains or TAD boundaries to merge the bin counts of a Hi-C matrix.

chicQualityControl

preprocessing

Hi-C matrices reference point BED file

two plots accepted reference point BED file

Checks for sparsity of viewpoints and removes them if too sparse.

chicViewpointBackgroundModel

preprocessing

Hi-C matrices reference point BED file

background model file

Creates a background model for all given samples and reference points.

chicViewpoint

preprocessing

Hi-C matrices background model file

viewpoint file(s)

Creates per sample per viewpoint one viewpoint file.

chicSignificantInteractions

chicAggregateStatistic

preprocessing analysis

viewpoint file(s) background model file

significant interaction file(s) target file(s)

Detects significant interactions per viewpoint based on the background and neighborhood merging via x-fold and loose p-values.

preprocessing

viewpoint files(s) target file (s)

aggregated file(s) for differential test

Aggregates for one viewpoint of two samples via a target file the locations to test for differential interactions.

chicDifferentialTest

analysis

aggregated file(s) of two samples

H0_accepted-, H0_rejected-files

Tests with chi2 or fisher for differential interactions of two samples.

chicPlotViewpoint

visualization

viewpoint file(s) differential expression file(s) significant interactions file(s)

one image per viewpoint

Visualization of a viewpoint.

General principles

A typical HiCExplorer command could look like this:

$ hicPlotMatrix -m myHiCmatrix.h5 \
-o myHiCmatrix.pdf \
--clearMaskedBins \
--region chrX:10,000,000-15,000,000 \
--vMin -4 --vMax 4 \

You can always see all available command-line options via –help:

$ hicPlotMatrix --help
  • Output format of plots should be indicated by the file ending, e.g. MyPlot.pdf will return a pdf file, MyPlot.png a png-file.

  • Most of the tools that produce plots can also output the underlying data - this can be useful in cases where you don’t like the HiCExplorer visualization, as you can then use the data matrices produced by deepTools with your favorite plotting tool, such as R.

  • The vast majority of command line options are also available in Galaxy (in a few cases with minor changes to their naming).

Example usage

Hi-C analysis of mouse ESCs using HiCExplorer

The following example shows how we can use HiCExplorer to analyze a published dataset. Here we are using a Hi-C dataset from Marks et. al. 2015, on mouse ESCs.

Protocol

The collection of the cells for Hi-C and the Hi-C sample preparation procedure was performed as previously described Lieberman-Aiden et al., with the slight modification that DpnII was used as restriction enzyme during initial digestion. Paired-end libraries were prepared according to Lieberman-Aiden et al. and sequenced on the NextSeq 500 platform using 2 × 75 bp sequencing.

Prepare for analysis
Download Raw fastq files

The fastq files can be downloaded from the EBI archive (or NCBI archive). We will store the files in the directory original_data.

mkdir original_data

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/007/SRR1956527/SRR1956527_1.fastq.gz -O original_data/SRR1956527_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/007/SRR1956527/SRR1956527_2.fastq.gz -O original_data/SRR1956527_2.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/008/SRR1956528/SRR1956528_1.fastq.gz -O original_data/SRR1956528_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/008/SRR1956528/SRR1956528_2.fastq.gz -O original_data/SRR1956528_2.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/009/SRR1956529/SRR1956529_1.fastq.gz -O original_data/SRR1956529_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR195/009/SRR1956529/SRR1956529_2.fastq.gz -O original_data/SRR1956529_2.fastq.gz
Create an index

We start with creating an index for our alignment software for the GRCm38/mm10 genome. As a source we use the mm10 genome from UCSC

mkdir genome_mm10
wget http://hgdownload-test.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz -O genome_mm10/chromFa.tar.gz
tar -xvzf genome_mm10/chromFa.tar.gz
cat genome_mm10/*.fa > genome_mm10/mm10.fa

We have the mm10 genome stored in one fasta file and can build the index. We tried it successfully with hisat2, bowtie2 and bwa. Run the mapping with one of them and do not mix them!

hisat2
hisat2-build -p 8 genome_mm10/mm10.fa hisat2/mm10_index

You can find more information about hisat

bowtie2
bowtie2-build genome_mm10/mm10.fa bowtie2/mm10_index --threads 8

You can find more information about bowtie

bwa
bwa index -p bwa/mm10_index genome_mm10/mm10.fa

You can find more information about bwa

Mapping the RAW files

Mates have to be mapped individually to avoid mapper specific heuristics designed for standard paired-end libraries.

It is important to have in mind for the different mappers:

  • for either bowtie2 or hisat2 use the –reorder parameter which tells bowtie2 or hisat2 to output the sam files in the exact same order as in the .fastq files.

  • use local mapping, in contrast to end-to-end. A fraction of Hi-C reads are chimeric and will not map end-to-end thus, local mapping is important to increase the number of mapped reads.

  • Tune the aligner parameters to penalize deletions and insertions. This is important to avoid aligned reads with gaps if they happen to be chimeric.

hisat2
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956527_1.fastq.gz --reorder | samtools view -Shb - > SRR1956527_1.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956527_2.fastq.gz --reorder | samtools view -Shb - > SRR1956527_2.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956528_1.fastq.gz --reorder | samtools view -Shb - > SRR1956528_1.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956528_2.fastq.gz --reorder | samtools view -Shb - > SRR1956528_2.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956529_1.fastq.gz --reorder | samtools view -Shb - > SRR1956529_1.bam
hisat2 -x hisat2/mm10_index --threads 8 -U ../original_data/SRR1956529_2.fastq.gz --reorder | samtools view -Shb - > SRR1956529_2.bam
bowtie2
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956527_1.fastq.gz --reorder | samtools view -Shb - > SRR1956527_1.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956527_2.fastq.gz --reorder | samtools view -Shb - > SRR1956527_2.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956528_1.fastq.gz --reorder | samtools view -Shb - > SRR1956528_1.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956528_2.fastq.gz --reorder | samtools view -Shb - > SRR1956528_2.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956529_1.fastq.gz --reorder | samtools view -Shb - > SRR1956529_1.bam
bowtie2 -x bowtie2/mm10_index --threads 8 -U ../original_data/SRR1956529_2.fastq.gz --reorder | samtools view -Shb - > SRR1956529_2.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956527_1.fastq.gz | samtools view -Shb - > SRR1956527_1.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956527_2.fastq.gz | samtools view -Shb - > SRR1956527_2.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956528_1.fastq.gz | samtools view -Shb - > SRR1956528_1.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956528_2.fastq.gz | samtools view -Shb - > SRR1956528_2.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956529_1.fastq.gz | samtools view -Shb - > SRR1956529_1.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 bwa/mm10_index original_data/SRR1956529_2.fastq.gz | samtools view -Shb - > SRR1956529_2.bam
Build, visualize and correct Hi-C matrix
Create a Hi-C matrix using the aligned files

In the following we will create three Hi-C matrices and merge them to one.

Build Hi-C matrix

hicBuildMatrix builds the matrix of read counts over the bins in the genome, considering the sites around the given restriction site. We need to provide:

  • the input BAM/SAM files: –samFiles SRR1956527_1.sam SRR1956527_2.sam

  • binsize: –binSize 1000

  • restriction sequence: –restrictionSequence GATC

  • dangling end sequence: –danglingSequence GATC

  • the restriction sequence cut site file: create it with hicFindRestSite, the restriction sequence ‘GATC’ and the above used mouse genome fasta file

  • the name of output bam file which contains the accepted alignments: –outBam SRR1956527_ref.bam

  • name of output matrix file: –outFileName hicMatrix/SRR1956527_10kb.h5

  • the folder for the quality report: –QCfolder hicMatrix/SRR1956527_QC

  • the number of to be used threads. Minimum value is 3: –threads 8

  • the buffer size for each thread buffering inputBufferSize lines of each input BAM/SAM file: –inputBufferSize 400000

To build the Hi-C matrices:

mkdir hicMatrix
hicBuildMatrix --samFiles SRR1956527_1.bam SRR1956527_2.bam --binSize 10000 --restrictionSequence GATC --danglingSequence GATC --restrictionCutFile cut_sites.bed --outBam SRR1956527_ref.bam --outFileName hicMatrix/SRR1956527_10kb.h5 --QCfolder hicMatrix/SRR1956527_10kb_QC --threads 8 --inputBufferSize 400000
hicBuildMatrix --samFiles SRR1956528_1.bam SRR1956528_2.bam --binSize 10000 --restrictionSequence GATC --danglingSequence GATC --restrictionCutFile cut_sites.bed --outBam SRR1956528_ref.bam --outFileName hicMatrix/SRR1956528_10kb.h5 --QCfolder hicMatrix/SRR1956528_10kb_QC --threads 8 --inputBufferSize 400000
hicBuildMatrix --samFiles SRR1956529_1.bam SRR1956529_2.bam --binSize 10000 --restrictionSequence GATC --danglingSequence GATC --restrictionCutFile cut_sites.bed --outBam SRR1956529_ref.bam --outFileName hicMatrix/SRR1956529_10kb.h5 --QCfolder hicMatrix/SRR1956529_10kb_QC --threads 8 --inputBufferSize 400000

The output bam files show that we have around 34M, 54M and 58M selected reads for SRR1956527, SRR1956528 & SRR1956529, respectively. Normally 25% of the total reads are selected. The output matrices have counts for the genomic regions. The extension of output matrix files is .h5.

A quality report is created in e.g. hicMatrix/SRR1956527_10kb_QC, have a look at the report hicQC.html.

The Hi-C quality report showing the results for 'pairs used & filtered'

A segment of Hi-C quality report.

Merge (sum) matrices from replicates

To increase the depth of reads we merge the counts from these three replicates.

hicSumMatrices --matrices hicMatrix/SRR1956527_10kb.h5 hicMatrix/SRR1956528_10kb.h5 \
        hicMatrix/SRR1956529_10kb.h5 --outFileName hicMatrix/replicateMerged_10kb.h5
Plot Hi-C matrix

A 10kb bin matrix is quite large to plot and is better to reduce the resolution (to know the size of a Hi-C matrix use the tool hicInfo), i.e. we usually run out of memory for a 1 kb or a 10 kb matrix and second, the time to plot is very long (minutes instead of seconds). For this we use the tool hicMergeMatrixBins.

Merge matrix bins for plotting

hicMergeMatrixBins merges the bins into larger bins of given number (specified by –numBins). We will merge 1000 bins in the original (uncorrected) matrix and then correct it. The new bin size is going to be 10.000 bp * 100 = 1.000.000 bp = 1 Mb

hicMergeMatrixBins \
--matrix hicMatrix/replicateMerged_10kb.h5 --numBins 100 \
--outFileName hicMatrix/replicateMerged.100bins.h5
Plot the corrected Hi-C matrix

hicPlotMatrix can plot the merged matrix. We use the following options:

  • the matrix to plot: –matrix hicMatrix/replicateMerged.100bins.h5

  • logarithmic values for plotting: –log1p

  • the resolution of the plot: –dpi 300

  • masked bins should not be plotted: –clearMaskedBins

  • the order of the chromosomes in the plot: –chromosomeOrder chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY

  • the color map: –colorMap jet

  • the title of the plot: –title “Hi-C matrix for mESC”

  • the plot image itself: –outFileName plots/plot_1Mb_matrix.png

mkdir plots
hicPlotMatrix \
--matrix hicMatrix/replicateMerged.100bins.h5 \
--log1p \
--dpi 300 \
--clearMaskedBins \
--chromosomeOrder chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY \
--colorMap jet \
--title "Hi-C matrix for mESC" \
--outFileName plots/plot_1Mb_matrix.png
corrected\_1Mb\_plot

The Hi-C interaction matrix with a resolution of 1 MB.

Correct Hi-C Matrix

hicCorrectMatrix corrects the matrix counts in an iterative manner. For correcting the matrix, it’s important to remove the unassembled scaffolds (e.g. NT_) and keep only chromosomes, as scaffolds create problems with matrix correction. Therefore we use the chromosome names (1-19, X, Y) here. Important: Use ‘chr1 chr2 chr3 etc.’ if your genome index uses chromosome names with the ‘chr’ prefix.

Matrix correction works in two steps: first a histogram containing the sum of contact per bin (row sum) is produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads. The second steps removes the low scoring bins and does the correction.

In the following we will use a matrix with a bin size of 20 kb: 10kb * 2 = 20 kb

hicMergeMatrixBins \
--matrix hicMatrix/replicateMerged_10kb.h5 --numBins 2 \
--outFileName hicMatrix/replicateMerged.matrix_20kb.h5

(1-19, X, Y) variant:

hicCorrectMatrix diagnostic_plot \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 --plotName hicMatrix/diagnostic_plot.png

(chr1-ch19, chrX, chrY) variant:

hicCorrectMatrix diagnostic_plot \
--chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 --plotName hicMatrix/diagnostic_plot.png
diagplot

Diagnostic plot for the Hi-C matrix at a resolution of 20 kb

The output of the program prints a threshold suggestion that is usually accurate but is better to revise the histogram plot. The threshold is visualized in the plot as a black vertical line. See Example usage for an example and for more info.

The threshold parameter needs two values:
  • low z-score

  • high z-score

“The absolute value of z represents the distance between the raw score and the population mean in units of the standard deviation. z is negative when the raw score is below the mean, positive when above.” (Source). For more information see wikipedia.

z-score definition: z = (x - my) / sigma

The z-score definition.

In our case the distribution describes the counts per bin of a genomic distance. To remove all bins with a z-score threshold less / more than X means to remove all bins which have less / more counts than X of mean of their specific distribution in units of the standard deviation.

Looking at the above distribution, we can select the value of -2 (lower end) and 3 (upper end) to remove. This is given by the –filterThreshold option in hicCorrectMatrix.

(1-19, X, Y) variant:

hicCorrectMatrix correct \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 \
--filterThreshold -2 3 --perchr --outFileName hicMatrix/replicateMerged.Corrected_20kb.h5

(chr1-ch19, chrX, chrY) variant:

hicCorrectMatrix correct \
--chromosomes chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY \
--matrix hicMatrix/replicateMerged.matrix_20kb.h5 \
--filterThreshold -2 3 --perchr --outFileName hicMatrix/replicateMerged.Corrected_20kb.h5

It can happen that the correction stops with:

`ERROR:iterative correction:*Error* matrix correction produced extremely large values.
This is often caused by bins of low counts. Use a more stringent filtering of bins.`

This can be solved by a more stringent z-score values for the filter threshold or by a look at the plotted matrix. In our case we see that chromosome Y is having more or less 0 counts in its bins. This chromosome can be excluded from the correction by not defining it for the set of chromosomes that should be corrected, parameter –chromosomes.

In the case of multiple samples / replicates that need to be normalized to the same read coverage we recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

Plot corrected matrix

We can now plot the one of the chromosomes (e.g. chromosome X) , with the corrected matrix.

New parameter:
  • The region to plot: –region chrX:10000000-2000000 or –region chrX

(1-19, X, Y) variant:

hicPlotMatrix \
--log1p --dpi 300 \
-matrix hicMatrix/replicateMerged.Corrected_20kb.npz \
--region X --title "Corrected Hi-C matrix for mESC : chrX" \
--outFileName plots/replicateMerged_Corrected-20kb_plot-chrX.png

(chr1-ch19, chrX, chrY) variant:

hicPlotMatrix \
--log1p --dpi 300 \
--matrix hicMatrix/replicateMerged.Corrected_20kb.npz \
--region chrX --title "Corrected Hi-C matrix for mESC : chrX" \
--outFileName plots/replicateMerged_Corrected-20kb_plot-chrX.png
correctMatrixPlot

The Hi-C interaction matrix for chromosome X.

Plot TADs

“The partitioning of chromosomes into topologically associating domains (TADs) is an emerging concept that is reshaping our understanding of gene regulation in the context of physical organization of the genome” [Ramirez et al. 2017].

Find TADs

TAD calling works in two steps: First HiCExplorer computes a TAD-separation score based on a z-score matrix for all bins. Then those bins having a local minimum of the TAD-separation score are evaluated with respect to the surrounding bins to decide assign a p-value. Then a cutoff is applied to select the bins more likely to be TAD boundaries.

hicFindTADs tries to identify sensible parameters but those can be change to identify more stringent set of boundaries.

mkdir TADs
hicFindTADs --matrix hicMatrix/replicateMerged.Corrected_20kb.h5 \
--minDepth 60000 --maxDepth 120000 --numberOfProcessors 8 --step 20000 \
--outPrefix TADs/marks_et-al_TADs_20kb-Bins  --minBoundaryDistance 80000 \
--correctForMultipleTesting fdr --threshold 0.05

As an output we get the boundaries, domains and scores separated files. We will use in the plot below only the TAD-score file.

Build Tracks File

We can plot the TADs for a given chromosomal region. For this we need to create a track file containing the instructions to build the plot. The hicPlotTADs documentation contains the instructions to build the track file.

In following plot we will use the listed track file. Please store it as track.ini.

[hic]
file = hicMatrix/replicateMerged.Corrected_20kb.h5
title = HiC mESC chrX:99974316-101359967
colormap = RdYlBu_r
depth = 2000000
height = 7
transform = log1p
file_type = hic_matrix

[tads]
file = TADs/marks_et-al_TADs_20kb-Bins_domains.bed
file_type = domains
border_color = black
color = none
line_width = 1.5
overlay_previous = share-y
show_data_range = no

[x-axis]
fontsize = 16
where = top

[tad score]
file = TADs/marks_et-al_TADs_20kb-Bins_score.bm
title = TAD separation score
height = 4
file_type = bedgraph_matrix

[spacer]

[gene track]
file = mm10_genes_sorted.bed
height = 10
title = mm10 genes
labels = false

We used as a gene track mm10 genes and sorted with sortBed from bedtools.

Plot

We plot the result with:

(1-19, X, Y) variant:

hicPlotTADs --tracks track.ini --region X:98000000-105000000 \
--dpi 300 --outFileName plots/marks_et-al_TADs.png \
--title "Marks et. al. TADs on X"

(chr1-ch19, chrX, chrY) variant:

hicPlotTADs --tracks track.ini --region chrX:98000000-105000000 \
--dpi 300 --outFileName plots/marks_et-al_TADs.png \
--title "Marks et. al. TADs on X"

The result is:

TADplot

TADplot

Importing and Exporting HiCExplorer data

Exporting HiCExplorer output to Bioconductor

It’s possible to export Hi-C Matrices produced by HiCExplorer to bioconductor in R, which allows us to use existing bioconductor infrastructure for differential Hi-C analysis. The tool hicConvertFormat allows us to write Hi-C matrices in a format that can easily be imported in bioconductor as GInteractions object. Below is an example.

## Assuming HiCExplorer is installed in ~/programs
hicConvertFormat --matrix ~/programs/HiCExplorer/test/test_data/Li_et_al_2015.h5 \ --inputFormat h5
-o GInteration_example --outputFormat GInteractions

The output file is in tsv format. It looks like this :

V1

V2

V3

V4

V5

V6

V7

X

19537

20701

X

19537

20701

1054.47483

X

19537

20701

X

20701

22321

375.86990

X

19537

20701

X

22321

24083

222.53900

X

19537

20701

X

24083

25983

114.26340

X

19537

20701

X

25983

27619

95.87463

This file can now be loaded into R as a GInteractions object, as shown below :

## INSIDE R
library(GenomicRanges)
library(InteractionSet)

hic <- read.delim("GInteraction_example.tsv", header = FALSE)

# Converting data.frame to GInteraction
convertToGI <- function(df){
            row.regions <- GRanges(df$V1, IRanges(df$V2,df$V3))# interaction start
            col.regions <- GRanges(df$V4, IRanges(df$V5,df$V6))# interaction end
            gi <- GInteractions(row.regions, col.regions)
            gi$norm.freq <- df$V7 # Interaction frequencies
            return(gi)
}
                        }
hic.gi <- convertToGI(hic)

Multiple files can be loaded, and converted to an InteractionSet object. If you have prepared matrices using binning, the intervals in the matrices must be the same. Therefore it’s easy to merge these matrices together in an InteractionSet object. In case some bins don’t match, we can merge the GInteraction objects based on matching bins, as follows.

# assuming hic.gi is a list of two GInteration objects hic.gi1 and hic.gi2
# hic.gi <- list(hic.gi1, hic.gi2)

# Get common regions between the two objects
combined <- unique(c(hic.gi$hic.gi1, hic.gi$hic.gi2))

# replace original regions with the common regions
replaceRegions(hic.gi$hic.gi1) <- regions(combined)
replaceRegions(hic.gi$hic.gi2) <- regions(combined)

# Get the matching indexes between the two objects
matched <- lapply(hic.gi, function(x) {
            match(x, combined)
            })

# Create a count matrix (for interaction frequencies)
counts <- matrix(0, ncol = 2, nrow=length(combined)) # counts for unmatched bins set to zero

# fill in the counts for matched bins
counts[matched$hic.gi1,1] <- hic.gi$hic.gi1$norm.freq
counts[matched$hic.gi2,2] <- hic.gi$hic.gi2$norm.freq

# Finally, create the InteractionSet object
iset <- InteractionSet(counts, combined)

InteractionSet objects can be used for packages like diffHic, for differential Hi-C analysis.

  • For more information on working with GInteraction and InteractionSet objects in bioconductor check out this vignette.

Captured Hi-C data analysis

How we use HiCExplorer to analyse cHi-C data

This How-to is based on the published dataset from Andrey et al. 2017. For the tutorial, we use the samples FL-E13.5 and MB-E-10.5.

Disclaimer: With HiCExplorer 3.7 all data is stored in hdf5 based files to make the handling easier. Please check your version of HiCExplorer to make sure you use the latest version. Commands of older HiCExplorer version do not work with the introduced changes!

Download the raw data

Please download the raw data via the following links or via NCBI GSE84795 .

Dataset

forward

reverse

CC-FL-E135-Wt-Mm-Rep1

SRR3950565_1

SRR3950565_2

CC-FL-E135-Wt-Mm-Rep2

SRR3950566_1

SRR3950566_2

CC-MB-E105-Wt-Mm-Rep1

SRR3950559_1

SRR3950559_2

CC-MB-E105-Wt-Mm-Rep2

SRR3950560_1

SRR3950560_2

Mapping

Map the files with a mapper of your choice against the mm9 reference genome; as an example, the mapping with bowtie2 is shown.

bowtie2 -x mm9_index --threads 8 -U SRR3950565_1.fastq.gz --reorder | samtools view -Shb - > SRR3950565_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950565_2.fastq.gz --reorder | samtools view -Shb - > SRR3950565_2.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950566_1.fastq.gz --reorder | samtools view -Shb - > SRR3950566_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950566_2.fastq.gz --reorder | samtools view -Shb - > SRR3950566_2.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950559_1.fastq.gz --reorder | samtools view -Shb - > SRR3950559_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950559_2.fastq.gz --reorder | samtools view -Shb - > SRR3950559_2.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950560_1.fastq.gz --reorder | samtools view -Shb - > SRR3950560_1.bam
bowtie2 -x mm9_index --threads 8 -U SRR3950560_2.fastq.gz --reorder | samtools view -Shb - > SRR3950560_2.bam
Create cHi-C matrices

To create a cHi-C matrix we use HiCExplorer’s hicBuildMatrix for each replicate separately and merge the replicates into a single matrix later. Like Andrey et al., we use a resolution of 1kb and use the restriction enzyme DpnII.

hicBuildMatrix --samFiles SRR3950565_1.bam SRR3950565_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950565.cool --QCfolder SRR3950565_QC --threads 6
hicBuildMatrix --samFiles SRR3950566_1.bam SRR3950566_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950566.cool --QCfolder SRR3950566_QC --threads 6
hicBuildMatrix --samFiles SRR3950559_1.bam SRR3950559_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950559.cool --QCfolder SRR3950559_QC --threads 6
hicBuildMatrix --samFiles SRR3950560_1.bam SRR3950560_2.bam  --binSize 1000 --restrictionSequence GATC --outFileName SRR3950560.cool --QCfolder SRR3950560_QC --threads 6
hicSumMatrix --matrices SRR3950565.cool SRR3950566.cool --outFileName FL-E13-5.cool
hicSumMatrix --matrices SRR3950559.cool SRR3950560.cool --outFileName MB-E10-5.cool
Terminology: Reference point vs viewpoint

A reference point is one single genomic position i.e. chr1 500 510 is a reference point. A viewpoint is in contrast the region defined by the reference point and the up and downstream range, i.e. range 100 and reference point chr1 50 70 leads to the viewpoint chr1 400 610.

Creation of reference point file

Andrey et al. state that they used a total of 460 reference points, but that 24 were removed due to low sequence coverage or non-correspondence to a promoter region, leading to 446 in total.

To reproduce this, we need all reference points which are published in Supplementary Table S2 and S8.

It is simplest to create the reference point file in the following format using Excel and store it as a tab separated file:

chr1        4487435 4487435 Sox17

Otherwise, just download the prepared file. We will do the quality control on our own and compare with the results of Andrey et al.

Quality control

As a first step we compute the quality of each viewpoint by considering the sparsity. As soon as one viewpoint in one sample is less than the user-defined threshold (–sparsity), the reference point is no longer considered.

chicQualityControl -m FL-E13-5.cool MB-E10-5.cool -rp reference_points.bed --sparsity 0.025 --threads 20

The quality control creates five files: two plots showing the sparsity structure of the samples and three files containing the accepted reference points, the rejected ones and one file with all viewpoints and their sparsity level per sample.

In our example the plots look like the following:

_images/sparsity.png _images/histogram.png

The first plot shows the sparsity per sample for each viewpoint, while the second one shows the sparsity distribution as a histogram. It can be seen quite clearly that only a minority of the samples are really sparse and therefore need to be removed. The red line indicates the chosen sparsity level.

The reference point Tdap2b at chr1 19198995, which has a sparsity of 0.018 in FL-E13-5 and 0.016 in MB-E10-5, is considered to be of bad quality. To confirm this result we plot the viewpoint:

_images/Tfap2b_FL-E13-5_MB-E10-5_chr1_19198995_19198995.png

The plot shows there are effectively no interactions except with the reference point itself and confirm the point should be removed from the data.

The result of the quality control rejected 71 reference points as too sparse, but surprisingly the viewpoints rejected by Andrey et al. are accepted. An explanation for this could be that we only consider two samples and not all samples used by Andrey, and therefore we missed the bad quality of some viewpoints.

Please consider that this bad viewpoint was selected arbitrary out of the sample data and is only an example.

Download the data: Filtered reference points, Quality control raw data and rejected reference points.

Background model

The background model computes all viewpoints given by the reference points for both samples in a range defined by the parameter fixateRange. We recommend setting it to 500kb because real interactions above the range are rarely observed and very low interaction numbers such as 1 are already considered to be significant. With this setting, only the interactions in a range 500kb up- and downstream of the reference point are considered for each viewpoint. Based on this data, two background models are computed; the first one computes the average per relative distance to the reference point, and secondly, a negative binomial distribution per relative distance to the reference point is fitted. This first model is used for filtering in the significant interaction evaluation by an x-fold factor and for plotting. The negative binomial model is more important; it is used to compute a p-value per relative distance in each sample, which is used to make the decision if an interaction is considered as significant.

chicViewpointBackgroundModel -m FL-E13-5.cool MB-E10-5.cool --fixateRange 500000 -t 20 -rp reference_points.bed -o background_model.txt

The background model looks like this:

Relative position   size nbinom     prob nbinom     max value       mean value
-500000             75.895607451213 0.998528939430  2.333333333333  0.000101543771
-499000             90.348171762247 0.998725799952  2.750000000000  0.000104681360
-498000             78.512621775755 0.998514111424  2.800000000000  0.000106107536
-497000             75.706478185610 0.998327784087  3.800000000000  0.000116147819

You can download the background model.

Viewpoint computation

In this step the viewpoints for each reference point listed in a reference_points.bed-file is extracted from the interaction matrix, using the quality controlled file created by chicQualityControl. The up- and downstream range can be given via –range upstream downstream. Please use the same value for –fixateRange as in the background model computation. For each relative distance the x-fold over the average value of this relative distance is computed and each location is assigned a p-value based on the background negative binomial distribution for this relative distance. The output is one hdf5 based file containing all the data.

chicViewpoint -m FL-E13-5.cool MB-E10-5.cool --averageContactBin 5 --range 1000000 1000000 -rp referencePoints.bed -bmf background_model.txt --outFileName interactions.hdf5 --fixateRange 500000 --threads 20
chicExportData --file interactions.hdf5 --outputMode all -o interactions.tar.gz  -t 16

The text files for each matrix and viewpoint have the following structure:

# Chromosome        Start   End     Gene    Sum of interactions     Relative position       Relative Interactions   p-value x-fold  Raw
chr1        14167000        14168000        Eya1    673.000000000000        -133000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14168000        14169000        Eya1    673.000000000000        -132000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14169000        14170000        Eya1    673.000000000000        -131000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14170000        14171000        Eya1    673.000000000000        -130000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14171000        14172000        Eya1    673.000000000000        -129000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14172000        14173000        Eya1    673.000000000000        -128000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14173000        14174000        Eya1    673.000000000000        -127000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14174000        14175000        Eya1    673.000000000000        -126000 0.000000000000  1.000000000000  0.000000000000  0.000000000000
chr1        14175000        14176000        Eya1    673.000000000000        -125000 0.000297176820  0.042850447268  0.916568982282  0.200000000000
chr1        14176000        14177000        Eya1    673.000000000000        -124000 0.000297176820  0.042850447268  0.908160092485  0.200000000000

Each file contains a header with information about content of the different columns.

Significant interactions detection

To detect significant interactions and to prepare a target file for each viewpoint which will be used for the differential analysis, the script chicSignificantInteractions is used. It offers two modes: either the user can specify an x-fold value or a loose p-value. The first one considers all interactions with a minimum x-fold over the average background for its relative distribution as a candidate or secondly, all interactions with a loose p-value or less are considered. These are pre-selection steps to be able to detect wider peaks in the same way as sharp ones. All detected candidates are merged to one peak if they are direct neighbors, and the sum of all interaction values of this neighborhood is used to compute a new p-value. The p-value is computed based on the relative distance continuous negative binomial distribution of the interaction with the original highest interaction value. All peaks considered are accepted as significant interactions if their p-value is as small as the threshold –pvalue.

To exclude interactions with an interaction value smaller than desired the parameter –peakInteractionsThreshold can be set.

In this example we use the created interactions.hdf5 file of chicViewpoint, a loose p-value of 0.1 and p-value of 0.01. For all stored locations the significant interactions are computed:

chicSignificantInteractions --interactionFile interactions.hdf5 -bmf background_model.txt --range 1000000 1000000 --pValue 0.01 --loosePValue 0.1 --outFileNameSignificant significant.hdf5 --outFileNameTarget target.hdf5 --combinationMode dual

This creates two files, a file storing all significant interactions, significant.hdf5, and a target file, target.hdf5. The content of the two files can be export via chicExport:

chicExportData --file significant.hdf5 --outputMode all -o targets.tar.gz  -t 16
chicExportData --file target.hdf5 --outputMode all -o targets.tar.gz  -t 16

For all tools it is also possible to just extract one gene by the gene name as it was given in the fourth column of the reference file. For each name, the data for all stored matrices is extracted. This mode works also on the interactions, target, aggregated and differential files.

chicExportData --file significant.hdf5 --outputMode geneName --outputModeName Eya1  -t 16

The significant interaction files looks like the following:

# Chromosome        Start   End     Gene    Sum of interactions     Relative position       Relative Interactions   p-value x-fold  Raw
chr1        14274000        14278000        Eya1    673.000000000000        -26000  0.008320950966  0.007037698679  7.811125758170  5.600000000000
chr1        14296000        14298000        Eya1    673.000000000000        -3000   0.057949479941  0.104215728599  3.781355935916  39.000000000000
chr1        14314000        14316000        Eya1    673.000000000000        14000   0.015156017831  0.048507853154  3.190333935010  10.200000000000
chr1        14317000        14319000        Eya1    673.000000000000        17000   0.014561664190  0.034914431882  3.517090338584  9.800000000000
chr1        14480000        14488000        Eya1    673.000000000000        184000  0.011292719168  0.000000010481  19.285082946462 7.600000000000
chr1        14491000        14501000        Eya1    673.000000000000        200000  0.030683506686  0.000000000000  56.967824833429 20.650000000000

The target file looks like:

chr1        14274000        14278000
chr1        14295000        14298000
chr1        14314000        14319000
chr1        14426000        14432000
chr1        14447000        14455000
chr1        14460000        14465000
chr1        14480000        14488000
chr1        14491000        14501000

The parameter –combinationMode has the options single and dual. This parameter is important if a differential analysis should be computed, either a target region is only for one gene of one matrix, or the regions of one gene of two matrices are combined. For example: - dual combines as follows: [[matrix1_gene1, matrix2_gene1], [matrix2_gene1, matrix3_gene1],[matrix1_gene2, matrix2_gene2], …] - single combines as follows: [matrix1_gene1, matrix1_gene2, matrix2_gene1, …]

Aggregate data for differential test

The process to aggregate data is only necessary if the differential test is used. chicAggregateStatistic takes the created interaction file from chicViewpoint as input and either the target file created by chicSignificantInteractions or one target file which applies for all viewpoints.

chicAggregateStatistic --interactionFile interaction.hdf5 --targetFile target.hdf5 --outFileName aggregate.hdf5 -t 16
chicAggregateStatistic --interactionFile interaction.hdf5 --targetFile regions_of_interest.bed --outFileName aggregate.hdf5 -t 16

It selects the original data based on the target locations and returns one hdf5 based file.

Differential test

The differential test tests the interaction value of the reference point and the interaction value of the target of two samples for a differential expression. To achieve this, either Fisher’s test or the chi-squared test can be used. H0 is defined as ‘both locations are equal’, meaning the differential expressed targets can be found in the H0 rejected file.

chicDifferentialTest --aggregatedFile aggregate.hdf5  --alpha 0.05 --statisticTest fisher --outFileName differential.hdf5 -t 16

It is important to set the desired alpha value and the output is written to one hdf5 based file. For each sample three internal datasets are created:

  • H0 rejected targets

  • H0 accepted targets

  • one file containing both

The data can be exported via chicExportData:

chicExportData --file differential.hdf5 --outputMode all  -o differential.tar.gz  -t 16

The created tar.gz file contains for each tested regions the three files rejected, ‘accepted’ and a file containing both information. It is also possible to extract only one gene:

chicExportData --file differential.hdf5 --outputMode geneName --outputModeName Eya1 -o differential_one  -t 16

In case of the single extraction, the output file name serves only to determine the folder to save the data. The names are based on the gene name.

# Chromosome        Start   End     Gene    Relative distance       sum of interactions 1   target_1 raw    sum of interactions 2   target_2 raw    p-value
chr1        14274000        14278000        Eya1    -23000  673.000000000000        5.600000000000  832.000000000000        0.000000000000  0.008134451704
chr1        14295000        14298000        Eya1    -3000   673.000000000000        44.400000000000 832.000000000000        49.800000000000 0.670801771179
chr1        14314000        14319000        Eya1    18000   673.000000000000        24.200000000000 832.000000000000        23.400000000000 0.385952554141
chr1        14426000        14432000        Eya1    131000  673.000000000000        2.400000000000  832.000000000000        9.000000000000  0.245228952263
chr1        14447000        14455000        Eya1    154000  673.000000000000        4.400000000000  832.000000000000        12.400000000000 0.232102031454
chr1        14460000        14465000        Eya1    164000  673.000000000000        3.200000000000  832.000000000000        7.600000000000  0.564250349826
chr1        14480000        14488000        Eya1    187000  673.000000000000        7.600000000000  832.000000000000        3.200000000000  0.151571165731
chr1        14491000        14501000        Eya1    200000  673.000000000000        20.650000000000 832.000000000000        18.750000000000 0.338670136491
Plotting of Viewpoints

chicPlotViewpoint can plot viewpoints of one gene in one plot, add the mean background, show the p-value per relative distance per sample as an additional heatmap bar and highlight significant interactions or differential expressed regions.

One viewpoint:

chicPlotViewpoint --interactionFile interactions.hdf5 --combinationMode oneGene --combinationName Eya1 --range 200000 200000 -o single_plot.tar.gz --outputFormat png
_images/single_plot.png

Two viewpoints, background, differential expression and p-values:

chicPlotViewpoint --interactionFile interactions.hdf5 --combinationMode dual --range 300000 300000 --pValue --differentialTestResult differential.hdf5 --backgroundModelFile background_model.txt -o differential_background_pvalue.tar.gz

This command plots two viewpoints of two different matrices of the same gene in a plot and highlights the differential regions.

_images/differential_background_pvalue.png

Two viewpoints, background, significant interactions and p-values:

chicPlotViewpoint --interactionFile interaction.hdf5 --combinationMode dual --range 300000 300000 --pValue --significantInteractions significant.hdf5 --plotSignificantInteractions --backgroundModelFile background_model.txt -o significant_background_pvalue.tar.gz
_images/significant_background_pvalue.png

How we use HiCExplorer

To generate a Hi-C contact matrix is necessary to perform the following basic steps

  1. Map the Hi-C reads to the reference genome

  2. Filter the aligned reads to create a contact matrix

  3. Filter matrix bins with low or zero read coverage

  4. Remove biases from the Hi-C contact matrices

After a corrected Hi-C matrix is created other tools can be used to visualize it, call TADS or compare it with other matrices.

Reads mapping

Mates have to be mapped individually to avoid mapper specific heuristics designed for standard paired-end libraries.

We have used the HiCExplorer successfully with bwa, bowtie2 and hisat2. However, it is important to:

  • for either bowtie2`or `hisat2 use the –reorder parameter which tells bowtie2 or hisat2 to output the sam files in the exact same order as in the .fastq files.

  • use local mapping, in contrast to end-to-end. A fraction of Hi-C reads are chimeric and will not map end-to-end thus, local mapping is important to increase the number of mapped reads.

  • Tune the aligner parameters to penalize deletions and insertions. This is important to avoid aligned reads with gaps if they happen to be chimeric.

# map the reads, each mate individually using
# for example bwa
#
# bwa mem mapping options:
#       -A INT        score for a sequence match, which scales options -TdBOELU unless overridden [1]
#       -B INT        penalty for a mismatch [4]
#       -O INT[,INT]  gap open penalties for deletions and insertions [6,6]
#       -E INT[,INT]  gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1] # this is set very high to avoid gaps
#                                  at restriction sites. Setting the gap extension penalty high, produces better results as
#                                  the sequences left and right of a restriction site are mapped independently.
#       -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5] # this is set to no penalty.

$ bwa mem -A1 -B4  -E50 -L0  index_path \
    mate_R1.fastq.gz 2>>mate_R1.log | samtools view -Shb - > mate_R1.bam

$ bwa mem -A1 -B4  -E50 -L0  index_path \
    mate_R2.fastq.gz 2>>mate_R2.log | samtools view -Shb - > mate_R2.bam
Creation of a Hi-C matrix

Once the reads have been mapped the Hi-C matrix can be built. For this, the minimal extra information required is the binSize used for the matrix. Is it best to enter a low number like 10.000 because lower resolution matrices (larger bins) can be easily constructed using hicMergeMatrixBins. Matrices at restriction fragment resolution can be created by providing a file containing the restriction sites, this file can be created with the tool hicFindRestSites

hicFindRestSites that is part of HiCExplorer.

# build matrix from independently mated read pairs
# the restriction sequence GATC is recognized by the DpnII restriction enzyme

$ hicBuildMatrix --samFiles mate_R1.bam mate_R2.bam \
                 --binSize 10000 \
                 --restrictionSequence GATC \
                 --danglingSequence GATC \
                 --restrictionCutFile cut_sites.bed \
                 --threads 4 \
                 --inputBufferSize 100000 \
                 --outBam hic.bam \
                 -o hic_matrix.h5 \
                 --QCfolder ./hicQC

hicBuildMatrix creates two files, a bam file containing only the valid Hi-C read pairs and a matrix containing the Hi-C contacts at the given resolution. The bam file is useful to check the quality of the Hi-C library on the genome browser. A good Hi-C library should contain piles of reads near the restriction fragment sites. In the QCfolder a html file is saved with plots containing useful information for the quality control of the Hi-C sample like the number of valid pairs, duplicated pairs, self-ligations etc. Usually, only 25%-40% of the reads are valid and used to build the Hi-C matrix mostly because of the reads that are on repetitive regions that need to be discarded.

An important quality control measurement to check is the inter chromosomal fraction of reads as this is an indirect measure of random Hi-C contacts. Good Hi-C libraries have lower than 10% inter chromosomal contacts. The hicQC module can be used to compare the QC measures from different samples.

Correction of Hi-C matrix

The Hi-C matrix has to be corrected to remove GC, open chromatin biases and, most importantly, to normalize the number of restriction sites per bin. Because a fraction of bins from repetitive regions contain few contacts it is necessary to filter those regions first. Also, in mammalian genomes some regions enriched by reads should be discarded. To aid in the filtering of regions hicCorrectMatrix generates a diagnostic plot as follows:

$ hicCorrectMatrix diagnostic_plot -m hic_matrix.h5 -o hic_corrected.png

The plot should look like this:

_images/diagnostic_plot.png

Histogram of the number of counts per bin.

For the upper threshold is only important to remove very high outliers and thus a value of 5 could be used. For the lower threshold it is recommended to use a value between -2 and -1. What it not desired is to try to correct low count bins which could result simply in an amplification of noise. For the upper threshold is not so concerning because those bins will be scaled down.

Once the thresholds have been decided, the matrix can be corrected

# correct Hi-C matrix
$ hicCorrectMatrix correct -m hic_matrix.h5 --filterThreshold -1.5 5 -o hic_corrected.h5

In the case of multiple samples / replicates that need to be normalized to the same read coverage we recommend to compute first the normalization (with hicNormalize) and correct the data (with hicCorrectMatrix) in a second step.

Visualization of results

There are two ways to see the resulting matrix, one using hicPlotMatrix and the other is using hicPlotTADs. The first one allows the visualization over large regions while the second one is preferred to see specific parts together with other information, for example genes or bigwig tracks.

Because of the large differences in counts found int he matrix, it is better to plot the counts using the –log1p option.

$ hicPlotMatrix -m hic_corrected.h5 -o hic_plot.png --region 1:20000000-80000000 --log1p
_images/corrected_matrix_example.png

Corrected Hi-C counts in log scale.

Quality control of Hi-C data and biological replicates comparison

HiCExplorer integrates multiple tools that allow the evaluation of the quality of Hi-C libraries and matrices.

  • hicQC on the log files produced by hicBuildMatrix and control of the pdf file produced.

Proportion of useful reads is important to assess the efficiency of the HiC protocol, which is dependant of proportion of dangling ends detected… Proportion of inter chromosomal, short range and long range contacts are important for….

  • hicPlotDistVsCounts to compare the distribution of corrected Hi-C counts in relation with the genomic

distance between multiple samples. If some differences are observed between biological replicates, these can be investigated more precisely by computing log2ratio matrices.

  • hicCompareMatrices log2ratio of matrices of biological replicates to identify where the potential changes are located.

  • hicPlotPCA bins correlation of two biological replicates.

TAD calling

To call TADs a corrected matrix is needed. Restriction fragment resolution matrices provide the best results. TAD calling works in two steps: First HiCExplorer computes a TAD-separation score based on a z-score matrix for all bins. Then those bins having a local minimum of the TAD-separation score are evaluated with respect to the surrounding bins to decide assign a p-value. Then a cutoff is applied to select the bins more likely to be TAD boundaries.

$ hicFindTADs -m hic_corrected.h5 --outPrefix hic_corrected --numberOfProcessors 16

This code will produce several files: 1. The TAD-separation score file, 2. the z-score matrix, 3. a bed file with the boundary location, 4. a bed file with the domains, 5. a bedgraph file with the TAD-score that can be visualized in a genome browser.

The TAD-separation score and the matrix can be visualized using hicPlotTADs.

_images/chorogenome_example.jpg

Example output from hicPlotTADs from http://chorogenome.ie-freiburg.mpg.de/

A / B compartment analysis

To compute the A / B compartments the matrix needs to be transformed to an observed/expected matrix in the way Lieberman-Aiden describes it. In a next step a pearson correlation matrix and based on it a covariance matrix is computed. Finally the eigenvectors based on the covariance matrix are computed. All these steps are computed with the command:

$ hicPCA -m hic_corrected.h5 --outFileName pca1.bw pca2.bw --format bigwig

If the intermediate matrices of this process should be used for plotting run:

$ hicTransform -m hic_corrected.h5 --outFileName all.h5 --method all

This creates all intermediate matrices: obs_exp_all.h5, pearson_all.h5 and covariance_all.h5.

The A / B compartments can be plotted with hicPlotMatrix.

$ hicPlotMatrix -m pearson_all.h5 --outFileName pca1.png --perChr --bigwig pca1.bw
_images/eigenvector1_lieberman.png

News and Developments

Release 3.7

27 July 2021

  • A new TAD prediction method: hicTADClassifier and a method to train own classifiers hicTrainTADClassifier. Thanks @AlbertLidel for the design and implementation.

  • New file formats for capture Hi-C modules. chicViewpoint, chicSignificantInteractions, chicAggregateStatistic and chicDifferentialTest use now HDF5 based files. A new export script chicExportData is provided to retrieve text files from these HDF5 files.

  • Implementing a few feature requests:
    • hicPlotMatrix: TADs can be visualized with hicPlotMatrix

    • hicAdjustMatrix is able to remove intra- or inter-chromosomal contacts (#664, #704)

    • hicValidateLocations: An option to validate TADs and to use additional data stored in cool matrices

    • hicPCA: Adding a function to select which eigenvector should be used for the output (#669)

    • hicConvertFormat: Adding the function to export hicpro file format and to import 2D-text files.

    • hicFindRestSites: Support of multiple restriction cut sequences (#659)

    • hicPlotMatrix: Option for loop locations spanning more than one bin to define if the start, center or end should be used for plotting (#640)

    • hicInterIntraTAD: A new script to compute the ratio of inter and intra TAD. ($404)

    • hicAggregateContacts: Option to consider the strand orientation (#633)

    • hicAverageRegions: Option to consider the strand orientation (#633)

    • hicCompareMatrices: An option to not normalize the matrices before the computation. (#677, #645) Thanks @lldelisle

    • hicDifferentialTAD: Adding rank sum statistics to the output (#728, #727). Thanks @dawe

    • hicPlotDistVsCounts: Adding a function to plot the counts vs distance in TAD regions. (#696) Thanks @lldelisle

  • Bug fixes:
    • hicCorrectMatrix: A bug that lead to wrong correction factors for the KR correction for cool files (#724)

    • hicDifferentialTAD: Solved multicore issue related to skipping data at the start and end of chromosomes (#725, #685)

    • hicHyperoptDetectLoops: Added an option to set if the chr prefix should be added or removed (#723)

    • hicPCA: Solving an issue if the region defined by the gene track is larger the region stored in the interaction matrix (#655, #710, #716, #719)

    • hicPCA: Fixing a bug where the masking of bins was automatically applied which lead to differing matrix dimensions for the e.g. the Pearson correlation matrices (#618)

    • hicBuildMatrix: Solving a bug if multiple restriction cut sites have the same dangling ends (#720)

    • hicBuildMatrix: Solving a bug that the parameter –removeSelfLigations was always set to true. Changed parameter name to –keepSelfLigations to keep the functionality. If the parameter is not set, the self ligations are removed.

    • hicBuildMatrix: If a region is specified, only the restrictionCutSite file information for that region is loaded to save memory (#646)

    • hicConvertFormat: Fixing a bug to copy the genome annotation information in the case of a cool to cool file conversion (#657)

    • hicCorrelate: Correcting the range of colors for the heatmap (#585)

    • hicCompartmentalization: Fixed index bug (#635, #637) Thanks @LeilyR

  • Updating hicBuildMatrix to be able to work with biopython versions > 1.77. Thanks @lldelisle (#691)

Release 3.6

10 November 2020

  • hicAggregateContacts, thanks @LeilyR:
    • hicAggragateContact has been updated to be able to handle inter chromosomal contacts as well as inter chromosomal contacts

    • Added scikit-learn to dependencies

  • hicBuildMatrix: Fixing another bug concerning scaffolds without any restriction enzyme cut site

  • Updated dependencies

  • Adding default values to the documentation. Thanks @lldelisle

  • hicTransform: Fixing a bug in case one of the intermediate matrices is empty

  • Official Python 3.8 support:
    • Manually setting ‘fork’ as the start method for multiprocessing because the default on macOS was set to ‘spawn’

Release 3.5.3

14 October 2020

  • Bug fix release:
    • Reads from scaffolds without any restriction enzyme cut site are considered as ‘same fragment’. An appearance of such a read will not lead to a crash anymore

  • Minor documentation improvements

Release 3.5.2

06 October 2020

  • Bug fix release:
    • enforcing version 15 of HiCMatrix. Version 14 had a bug concerning the application of the correction factors of cool files. See issue #595

    • Fixing a bug in hicDetectLoops in single-core mode. Thanks @alecw

    • Fixing a bug in hicDifferentialTAD concerning multiple chromosomes in the bed file. See issue #587

    • Updating dependencies to newest versions, except biopython. Forcing here <1.77 because their API change is breaking our source code. See issue #609

    • Fixing #596

  • Changing internal structure of the docs. Navigation should be better now.

Release 3.5.1

11 August 2020

  • patch hicCorrelate

  • hicBuildMatrix: Better help text

  • patch for hicPlotMatrix if matrix does not start at 0 and –region is used

  • bug fix for remove option in hicAdjustMatrix

Release 3.5

10 July 2020

  • Major update for hicDetectLoops: Results are now closer to HiCCUPS, it is faster and needs less memory.

  • hicHyperoptDetectLoops: New tool to compute the best parameter setting if the number of expected loops is given and a protein boundary file (e.g. CTCF on mammals) is provided

  • hicHyperoptDetectLoopsHiCCUPS: New tool to compute the best parameter setting for Juicers HiCCUPS. HiCCUPS and all its dependencies are not part of HiCExplorer and must be provided by the user. Number of expected loops and a protein boundary file (e.g. CTCF on mammals) must be provided.

  • hicMergeDomains: New tool to compute a merge of TADs computed on different resolutions. Moreover it provides a cleaning of the boundaries with the help of a protein peak file, and the hierarchical dependencies of TADs can be plotted. This tool is the result of the Bachelor thesis from Sarah Domogalla (@SarahMaria27). Thanks for your contribution!

  • hicDifferentialTAD: New tool to compute differential TADs between two samples

  • Bug fix for hicFindTADs: The format of the intermediate z-score matrices are now depending on the format of the input Hi-C matrix

  • Bug fix for chic*-modules: Fixate range is now correct applied.

  • chicSignificantInteractions, hicDetectLoops: Option to use a per relative position p-value with a p-value threshold file

  • Adding hicCreateThresholdFile: A script to generate a per position p-value threshold for hicDetectLoops and chicSignificantInteractions

  • Bug fix for hicPlotMatrix:
    • multiple bigwig tracks in the vertical setting are now supported

    • correct plot of bigwig if the given matrix does not start at the beginning of the chromosome

    • new parameters ‘increaseFigureWidth’ and ‘increaseFigureHeight’ to add more space to the plot if multiple bigwigs are plotted. Adjust this parameter to correct the plot of the heatmap which may be not quadratic anymore.

    • restriction of the loop regions to the user given range. This effects especially SVGs that will now contain less objects as before.

  • New feature for hicBuildMatrix:
    • multiple restriction cut sequences, dangling ends and restriction cut sites files are now supported

    • restriction cut sequences, dangling ends and restriction cut sites files are now mandatory parameters. This is now enforced to guarantee a correct detection of self ligations and self circles

  • hicPrepareQCreport: New support for multiple dangling ends

  • hicQuickQC: restriction cut sequences, dangling ends and restriction cut sites files are now mandatory parameters

  • hicFindRestSite: gz support for fasta file

  • Add fallback modes to multiple scripts if the parallelization fails.

  • hicAggregate: interactions between two bed files by comparing every row in the first bed file with its corresponding row in the second file. (issue #390)

  • hicAdjustMatrix: fix #341, 454
    • fixed –action remove: it actually remove the part from the matrix previously was masking it

    • the case where the end of the chromosome need to be removed.

  • New Azure testing for the general test cases, the trivial ones run on travis

Publication

02 July 2020

Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization Joachim Wolff, Leily Rabbani, Ralf Gilsbach, Gautier Richard, Thomas Manke, Rolf Backofen, Björn A Grüning Nucleic Acids Research, Volume 48, Issue W1, 02 July 2020, Pages W177–W184

Release 3.4.3

23 March 2020

  • Fixing the wrong p-value computation in for chicViewpoint. New method is more accurate for floating points.

  • Fixing a bug in chicViewpointBackgroundModel and chicQualityControl if an non-existing reference point was used.

  • Improving all chic* modules with a capturing of errors in sub-processes. It is now guaranteed that the computation will terminate. Either successful or by error.

  • Add option ‘truncateZero’ to chicViewpointBackgroundModel: This removes all zero values for the distributions before fitting to fight over dispersion.

  • Add option ‘decimalPlaces’ to chicViewpoint to adjust the decimal places in the output for all floating values. Helpful for really small p-values

  • Add option ‘truncateZeroPvalues’ to chicSignificantInteractions to set all p-values which are 0 to 1 and are therefore ignored.

  • Add option ‘truncateZeroPvalues’ to chicPlotViewpoint to set all p-values which are 0 to 1 and do not disturb the presentation of real p-values

Release 3.4.2

7 March 2020

  • This release fixes the wrong name scheme which was used in the chicModules. The most .bed files are now .txt files.

  • hicDetectLoops got an inner chromosome parallelization to decrease the compute time.

  • hicPlotMatrix got three new parameters: rotationX, rotationY and fontSize to adjust the position and font size of the labels. We hope this can lead in certain cases to a a better readability

  • hicPlotMatrix: fixed a bug that occurred if the list of chromosomes was given and the last chromosome appeared as an additional label.

  • Improving and updating the documentation.

Preprint

6 March 2020

The preprint of the loop detection algorithm is online via biorXiv: https://www.biorxiv.org/content/10.1101/2020.03.05.979096v1

Release 3.4.1

3 February 2020

  • This release fixes a bug in chicViewpoint that caused a crash if the data to be averaged is less than the window size.

Release 3.4

28 January 2020

  • Fixing a bug in hicAdjustMatrix: keep option had a bug concerning the cutting before the end of a chromosome or the start position was not from the beginning of the chromosome

  • hicCompartmentPolarization was renamed to hicCompartmentalization and got some bug fixes

  • Extending the option on how the observed vs. Expected matrix is computed and adding the parameter –ligation_factor to achieve a rescale behaviour of the values as it is implemented in Homer. The same changes are applied to hicTransform

  • Improved the documentation

  • Adding an option in hicAverageRegions to select start, end, center or start_end as start index for up/downstream range.

  • hicBuildMatrix: Removed default value of binSize to enable mutually exclusive group error if not one of them is set. Behaviour so far was that the binSize was taken.

  • hicPlotSVL: adding xlegend to plot of SVL ratios to indicate the data points per boxplots are the chromosome ratios

  • hicQuickQC: Removed binSize option of hicQuickQC because it does not matter for QC calculation and adding a sentence to recommend the usage of restriction enzyme and dangling end sequence. Fixing bug issue #464

  • hicNormalize: Adding option in hicNormalize to remove values after the normalization if values are smaller than a given threshold

  • Capture Hi-C modules: Change background model distribution assumption from negative binomial to continuous negative binomial by using Gamma functions as a replacement for the binomial coefficient. Source: https://stats.stackexchange.com/questions/310676/continuous-generalization-of-the-negative-binomial-distribution/311927

  • hicInfo: Implementing feature request #456. The length of chromosomes is now show in the information too

Release 3.3.1

15 November 2019

  • Fixing a bug in the labeling of chicPlotViewpoints if the value range is counted in MB

  • Add an option to chicViewpoint to pre-compute a x-fold of p-value over the maximum value of the relative distance

Release 3.3

8 October 2019

  • Fixing many bugs:
    • A bug in hicDetectLoops if a sub-matrix was very small

    • A bug in hicPlotMatrix if the region defined by –region was only a chromosome and loops should be plotted too

    • A bug in hicPlotMatrix if a loop region should be plotted and chromosomeOrder argument was used too

    • A bug in hicAggregateContacts (issue #405) if chromosomes were present in the matrix but not in the bed file

    • A bug in hicAdjustMatrix concerning a bed file and consecutive regions, see issue #343

    • A bug in hicAdjustMatrix if a chromosome is present in the matrix but not in the bed file, see issue #397

    • A bug in hicCompartmentsPolarization concerning the arguments ‘quantile’ and ‘outliers’ were interpreted as strings but should be integers

    • A bug in hicAdjustMatrix concerning the ‘keep’ option and how matrices are reordered internally. Thanks @LeilyR

  • Added features as requested:
    • hicPCA ignores now masked bins, see issue #342

    • chicPlotViewpoint:
      • Better legend handling on x-axis

      • Peaks are now display with their fill width

      • Add option –pValueSignificantLevels to categorize the p-values in x levels (e.g. 0.001 0.05 0.1)

    • chicViewpoint:
      • adding sorting via viewpoints and not by samples option (–allViewpointsList)

    • Adding an option to hicNormalize to normalize via multiplication and a use defined value (see issues #385, #424)

  • Rearrange hicAdjustMatrix to have a better accessibility to its functions from outside of main

  • Improving the documentation and fixing grammar / spelling mistakes. Thanks @simonbray

  • New script: hicPlotSVL to investigate short range vs long range ratios.

Release 3.2

** 22 August 2019**

  • Adding the new captured Hi-C module. Viewpoint analysis based on a background model, significant interaction detection and differential analysis are provided.

  • Adding documentation for captured Hi-C module and a tutorial on how to use it.

  • Adding a module to be able to detect quite fast the quality of a Hi-C data set: hicQuickQC.

  • Adding a tool to merge loops of different resolutions.

  • Improving validation of locations: Presorting is no longer necessary; adding feature to add ‘chr’ prefix to loop or protein chromosome name

  • Change loop detection slightly to improve results and fixed bugs:
    • preselection p-value was ignored and only p-value was used

    • adding additional test to the peak region test to decrease false discoveries

    • exchanging pThreshold / ln(distance) to remove too low values by a share of the maximum value of the distance. New parameter ‘maximumInteractionPercentageThreshold’

  • Removal of the folder ‘scripts’ and its content. These were outdated scripts and will maybe part of regular Hi-C tools in the future.

Release 3.1

9 July 2019

  • KR correction improvements: It is now able to process larger data sets like GM12878 primary+replicate on 10kb resolution.

  • Adding script for validation of loop locations with protein peak locations

  • Adding script hicCompartmentsPolarization: Rearrange the average interaction frequencies using the first PC values to represent the global compartmentalization signal

Release 3.0.2

28 June 2019

  • Pinning dependencies to:

    • hicmatrix version 9: API changes in version 10

    • krbalancing version 0.0.4: API changes in version 0.0.5

    • matplotlib version 3.0: Version 3.1 raises ‘Not implemented error’ for unknown reasons.

  • Set fit_nbinom to version 1.1: Version 1.0 Had deprecated function call of scipy > 1.2.

  • Small documentation fixes and improvements.

Release 3.0.1

5 April 2019

  • Fixes KR balancing correction factors

  • Deactivates log.debug

Release 3.0

3 April 2019

  • Python 3 only. Python 2.X is no longer supported

  • Additional Hi-C interaction matrix correction algorithm ‘Knight-Ruiz’ as a C++ module for a faster runtime and less memory usage.

  • Enriched regions detection tool: ‘hicDetectLoops’ based on strict candidate selection, ‘hicFindEnrichedContacts’ was deleted

  • Metadata for cooler files is supported: hicBuildMatrix and hicInfo are using it

  • New options for hicPlotMatrix: –loops to visualize computed loops from hicDetectLoops and –bigwigAdditionalVerticalAxis to display a bigwig track on the vertical axis too.

Release 2.2.3

22 March 2019

  • This bug fix release patches an issue with cooler files, hicBuildMatrix and the usage of a restriction sequence file instead of fixed bin size.

Release 2.2.2

27 February 2019

  • This bug fix release removes reference to hicExport that were forgotten to delete in 2.2. Thanks @BioGeek for this contribution.

Release 2.2.1

7 February 2019

  • Muting log output of matplotlib and cooler

  • Set version number of hicmatrix to 7

  • Optional parameter for hicInfo to write the result to a file instead to the bash

Release 2.2

18 January 2019

This release contains:

  • replaced hicExport by hicConvertFormat and hicAdjustMatrix

  • extended functionality for hicConvertFormat

    • read support for homer, hicpro, cool, h5

    • write support for h5, homer, cool

    • convert hic to cool

    • creation of mcool matrices

  • hicAdjustMatrix

    • remove, keep or mask specified regions from a file, or chromosomes

  • hicNormalize

    • normalize matrices to 0 - 1 range or to the read coverage of the lowest given

  • hicBuildMatrix

    • support for build mcool

  • restructuring the central class HiCMatrix to object oriented model and moved to its own library: deeptools/HiCMatrix.

    • Extended read / write support for file formats

    • better (faster, less memory) support for cool format

    • remove of old, unused code

    • restrict support to h5 and cool matrices, except hicConvertFormat

  • hicFindTADs: Option to run computation per specified chromosomes

  • hicPlotTADs: removed code and calls pyGenomeTracks

  • hicAverageRegions: Sum up in a given range around defined reference points. Useful to detect changes in TAD structures between different samples.

  • hicPlotAverageRegions: Plots such a average region

  • hicTransform: Restructuring the source code, remove of option ‘all’ because it was generating confusion. Adding option ‘exp_obs’, exp_obs_norm and exp_obs_lieberman. These three different options use different expectation matrix computations.

  • hicPCA

    • Adding –norm option to compute the expected matrix in the way HOMER is doing it. Useful for drosophila genomes

    • Adding option to write out the intermediate matrices ‘obs_exp’ and ‘pearson’ which are necessary in the computation of the PCA

  • hicPlotMatrix

    • Add option to clip bigwig values

    • Add option to scale bigwig values

  • Removed hicLog2Ration, functionality is covered by hicCompareMatrices

  • Extending test cases to cover more source code and be hopefully more stable.

  • Many small bugfixes

Publication

13 June 2018

We are proud to announce our latest publication:

Joachim Wolff, Vivek Bhardwaj, Stephan Nothjunge, Gautier Richard, Gina Renschler, Ralf Gilsbach, Thomas Manke, Rolf Backofen, Fidel Ramírez, Björn A Grüning. “Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization”, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W11–W16, doi: https://doi.org/10.1093/nar/gky504

Release 2.1.4

25 May 2018

  • cooler file format correction factors are applied as they should be

  • parameter ‘–region’ of hicBuildMatrix works with Python 3

Release 2.1.3

7 May 2018

The third bugfix release of version 2.1 corrects an error in hicPlotViewpoint. It adds a feature requested in issue #169 which should have been included in release 2.1 but was accidentally not.

From 2.1 release note: hicPlotViewpoint: Adds a feature to plot multiple matrices in one image

Release 2.1.2

26 April 2018

The second bug fix release of 2.1 includes:

  • documentation improvements

  • fixing broken Readthedocs documentation

  • Small bug fix concerning hicPlotMatrix and cooler: –chromosomeOrder is now possible with more than one chromosome

  • Small fixes concerning updated dependencies: Fixing version number a bit more specific and not that strict in test cases delta values.

Release 2.1.1

27 March 2018

This release fixes a problem related to python3 in which chromosome names were of bytes type

Release 2.1

5 March 2018

The 2.1 version of HiCExplorer comes with new features and bugfixes.

  • Adding the new feature hicAggregateContacts: A tool that allows plotting of aggregated Hi-C sub-matrices of a specified list of positions.

  • Many improvements to the documentation and the help text. Thanks to Gina Renschler and Gautier Richard from the MPI-IE Freiburg, Germany.

  • hicPlotMatrix

    • supports only bigwig files for an additional data track.

    • the argument –pca was renamed to –bigwig

    • Smoothing the bigwig values to neighboring bins if no data is present there

    • Fixes to a bug concerning a crash of tight_layout

    • Adding the possibility to flip the sign of the values of the bigwig track

    • Adding the possibility to scale the values of the bigwig track

  • hicPlotViewpoint: Adds a feature to plot multiple matrices in one image

  • cooler file format

    • supports mcool files

    • applies correction factors if present

    • optionally reads bin[‘weight’]

  • fixes

    • a crash in hicPlotTads if horizontal lines were used

    • checks if all characters of a title are ASCII. If not they are converted to the closest looking one.

  • Updated and fixate version number of the dependencies

Release 2.0

December 21, 2017

This release makes HiCExplorer ready for the future:

  • Python 3 support

  • Cooler file format support

  • A/B comparment analysis

  • Improved visualizations

  • bug fixes for --perChr option in hicPlotMatrix

  • eigenvector track with --pca for hicPlotMatrix

  • visualization of interactions around a reference point or region with hicPlotViewpoint

  • Higher test coverage

  • re-licensing from GPLv2 to GPLv3

Release 1.8.1

November 27, 2017

Bug fix release:

  • a fix concerning the handling chimeric alignments in hicBuildMatrix. Thanks to Aleksander Jankowski @ajank

  • handling of dangling ends was too strict

  • improved help message in hicBuildMatrix

Release 1.8

October 25, 2017

This release is adding new features and fixes many bugs:

  • hicBuildMatrix: Added multicore support, new parameters –threads and –inputBufferSize

  • hicFindTADs:

  • One call instead of two: hicFindTADs TAD_score and hicFindTADs find_TADs merged to hicFindTADs.

  • New multiple correction method supported: False discovery rate. Call it with –correctForMultipleTesting fdr and –threshold 0.05.

  • Update of the tutorial: mES-HiC analysis.

  • Additional test cases and docstrings to improve the software quality

  • Fixed a bug occurring with bigwig files with frequent NaN values which resulted in only NaN averages

  • hicPlotTADs: Support for plotting points

  • Moved galaxy wrappers to https://github.com/galaxyproject/tools-iuc

  • Fixed multiple bugs with saving matrices

  • hicCorrelate: Changes direction of dendograms to left

Release 1.7.2

April 3, 2017

  • Added option to plot bigwig files as a line hicPlotTADs

  • Updated documentation

  • Improved hicPlotMatrix –region output

  • Added compressed matrices. In our tests the compressed matrices are significantly smaller.

March 28, 2017

Release 1.7

March 28, 2017

This release adds a quality control module to check the results from hicBuildMatrix. By default, now hicBuildMatrix generates a HTML page containing the plots from the QC measures. The results from several runs of hicBuildMatrix can be combined in one page using the new tool hicQC.

Also, this release added a module called hicCompareMatrices that takes two Hi-C matrices and computes the difference, the ratio or the log2 ratio. The resulting matrix can be plotted with hicPlotMatrix to visualize the changes.

Preprint introducing HiCExplorer is now online

March 8, 2017

Our #biorXiv preprint on DNA sequences behind Fly genome architecture is online!

Read the article here : http://biorxiv.org/content/early/2017/03/08/115063

In this article, we introduce HiCExplorer : Our easy to use tool for Hi-C data analysis, also available in Galaxy.

We also introduce HiCBrowser : A standalone software to visualize Hi-C along with other genomic datasets.

Based on HiCExplorer and HiCBrowser, we built a useful resource for anyone to browse and download the chromosome conformation datasets in Human, Mouse and Flies. It’s called the chorogenome navigator

Along with these resources, we present an analysis of DNA sequences behind 3D genome of Flies. Using high-resolution Hi-C analysis, we find a set of DNA motifs that characterize TAD boundaries in Flies and show the importance of these motifs in genome organization.

We hope that these resources and analysis would be useful for the community and welcome any feedback.

HiCExplorer wins best poster prize at VizBi2016

March 20, 2016

We are excited to announce that HiCExplorer has won the NVIDIA Award for Best Scientific Poster in VizBi2016, the international conference on visualization of biological data.

Read more here

This was our poster :

HiCExplorer

Citation

Please cite HiCExplorer as follows:

Joachim Wolff, Leily Rabbani, Ralf Gilsbach, Gautier Richard, Thomas Manke, Rolf Backofen, Björn A Grüning. Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization, Nucleic Acids Research, Nucleic Acids Research, Volume 48, Issue W1, 02 July 2020, Pages W177–W184, https://doi.org/10.1093/nar/gkaa220

Joachim Wolff, Vivek Bhardwaj, Stephan Nothjunge, Gautier Richard, Gina Renschler, Ralf Gilsbach, Thomas Manke, Rolf Backofen, Fidel Ramírez, Björn A Grüning. Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W11–W16, doi: https://doi.org/10.1093/nar/gky504

Fidel Ramirez, Vivek Bhardwaj, Jose Villaveces, Laura Arrigoni, Bjoern A Gruening,Kin Chung Lam, Bianca Habermann, Asifa Akhtar, Thomas Manke. “High-resolution TADs reveal DNA sequences underlying genome organization in flies”. Nature Communications, Volume 9, Article number: 189 (2018), doi: https://doi.org/10.1038/s41467-017-02525-w

Citation

Please cite HiCExplorer as follows:

Joachim Wolff, Leily Rabbani, Ralf Gilsbach, Gautier Richard, Thomas Manke, Rolf Backofen, Björn A Grüning. Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization, Nucleic Acids Research, Nucleic Acids Research, Volume 48, Issue W1, 02 July 2020, Pages W177–W184, https://doi.org/10.1093/nar/gkaa220

Joachim Wolff, Vivek Bhardwaj, Stephan Nothjunge, Gautier Richard, Gina Renschler, Ralf Gilsbach, Thomas Manke, Rolf Backofen, Fidel Ramírez, Björn A Grüning. Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W11–W16, doi: https://doi.org/10.1093/nar/gky504

Fidel Ramirez, Vivek Bhardwaj, Jose Villaveces, Laura Arrigoni, Bjoern A Gruening,Kin Chung Lam, Bianca Habermann, Asifa Akhtar, Thomas Manke. “High-resolution TADs reveal DNA sequences underlying genome organization in flies”. Nature Communications, Volume 9, Article number: 189 (2018), doi: https://doi.org/10.1038/s41467-017-02525-w

_images/logo_mpi-ie.jpg

This tool suite is developed by the Bioinformatics Unit at the Max Planck Institute for Immunobiology and Epigenetics, Freiburg and by the Bioinformatics Lab of the Albert-Ludwigs-University Freiburg, Germany.