hicBuildMatrix

Using an alignment from a program that supports local alignment (eg. Bowtie2) where both PE reads are mapped using the –local option, this program reads such file and creates a matrix of interactions.

usage: hicBuildMatrix --samFiles two sam files two sam files --outFileName
                      FILENAME --QCfolder FOLDER --restrictionCutFile BED file
                      [BED file ...] --restrictionSequence RESTRICTIONSEQUENCE
                      [RESTRICTIONSEQUENCE ...] --danglingSequence
                      DANGLINGSEQUENCE [DANGLINGSEQUENCE ...]
                      [--outBam bam file] [--binSize BINSIZE [BINSIZE ...]]
                      [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE]
                      [--maxLibraryInsertSize MAXLIBRARYINSERTSIZE]
                      [--genomeAssembly GENOMEASSEMBLY]
                      [--region CHR:START-END] [--keepSelfLigation]
                      [--keepSelfCircles]
                      [--minMappingQuality MINMAPPINGQUALITY]
                      [--threads THREADS] [--inputBufferSize INPUTBUFFERSIZE]
                      [--doTestRun] [--doTestRunLines DOTESTRUNLINES]
                      [--skipDuplicationCheck] [--chromosomeSizes txt file]
                      [--help] [--version]

Required arguments

--samFiles, -s

The two PE alignment sam files to process

--outFileName, -o

Output file name for the Hi-C matrix.

--QCfolder

Path of folder to save the quality control data for the matrix. The log files produced this way can be loaded into hicQC in order to compare the quality of multiple Hi-C libraries.

--restrictionCutFile, -rs

BED file(s) with all restriction cut sites (output of “hicFindRestSite” command). Should only contain the restriction sites of the same genome which has been used to generate the input sam files. Using regions of a different genome version can generate false results! To use more than one restriction enzyme, generate a restrictionCutFile for each enzyne and list them space seperated.

--restrictionSequence, -seq

Sequence of the restriction site, if multiple are used, please list them space seperated. If a dangling sequence is listed at the same time, please preserve the same order.

--danglingSequence

Sequence left by the restriction enzyme after cutting, if multiple are used, please list them space seperated and preserve the order. Each restriction enzyme recognizes a different DNA sequence and, after cutting, they leave behind a specific “sticky” end or dangling end sequence. For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT. For DpnII, the restriction site and dangling end sequence are the same: GATC. This information is easily found on the description of the restriction enzyme. The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads. A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol.

Optional arguments

--outBam, -b

Output bam file to process. Optional parameter. A bam file containing all valid Hi-C reads can be created using this option. This bam file could be useful to inspect the distribution of valid Hi-C reads pairs or for other downstream analyses, but is not used by any HiCExplorer tool. Computation will be significantly longer if this option is set.

--binSize, -bs

Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. If not given, matrices of restriction site resolution will be built. Optionally for mcool file format: Define multiple resolutions which are all a multiple of the first value. Example: –binSize 10000 20000 50000 will create a mcool file formate containing the three defined resolutions.

--minDistance

Minimum distance between restriction sites. Restriction sites that are closer than this distance are merged into one. This option only applies if –restrictionCutFile is given (Default: 300).

Default: 300

--maxDistance

This parameter is now obsolete. Use –maxLibraryInsertSize instead

--maxLibraryInsertSize

The maximum library insert size defines different cut offs based on the maximum expected library size. This is not the average fragment size but the higher end of the the fragment size distribution (obtained using for example a Fragment Analyzer or a Bioanalyzer) which usually is between 800 to 1500 bp. If this value is not known use the default value. The insert value is used to decide if two mates belong to the same fragment (by checking if they are within this max insert size) and to decide if a mate is too far away from the nearest restriction site (Default: 1000).

Default: 1000

--genomeAssembly, -ga

The genome the reads were mapped to. Used for metadata of cool file.

--region, -r

Region of the genome to limit the operation to. The format is chr:start-end. It is also possible to just specify a chromosome, for example –region chr10

--keepSelfLigation

If set, inward facing reads less than 1000 bp apart and having a restrictionsite in between are removed. Although this reads do not contribute to any distant contact, they are useful to account for bias in the data (for the moment is always True).

Default: False

--keepSelfCircles

If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots.

Default: False

--minMappingQuality

minimum mapping quality for reads to be accepted. Because the restriction enzyme site could be located on top of the read, this may reduce the reported quality of the read. Thus, this parameter may be adjusted if too many low quality (but otherwise perfectly valid Hi-C reads) are found. A good strategy is to make a test run (using the –doTestRun), then checking the results to see if too many low quality reads are present and then using the bam file generated to check if those low quality reads are caused by the read not being mapped entirely (Default: 15).

Default: 15

--threads

Number of threads. Using the python multiprocessing module. One master process which is used to read the input file into the buffer and one process which is merging the output bam files of the processes into one output bam file. All other threads do the actual computation. Minimum value for the ‘–thread’ parameter is 2. The usage of 8 threads is optimal if you have an HDD. A higher number of threads is only useful if you have a fast SSD. Have in mind that the performance of hicBuildMatrix is influenced by the number of threads, the speed of your hard drive and the inputBufferSize. To clarify: the performance with a higher thread number is not negative influenced but not positive too. With a slow HDD and a high number of threads many threads will do nothing most of the time (Default: 4).

Default: 4

--inputBufferSize

Size of the input buffer of each thread. 400,000 read pairs per input file per thread is the default value. Reduce this value to decrease memory usage.

Default: 400000

--doTestRun

A test run is useful to test the quality of a Hi-C experiment quickly. It works by testing only 1,000,000 reads. This option is useful to get an idea of quality control values like inter-chromosomal interactions, duplication rates etc.

Default: False

--doTestRunLines

Number of lines to consider for the qc test run (Default: 1000000).

Default: 1000000

--skipDuplicationCheck

Identification of duplicated read pairs is memory consuming. Thus, in case of memory errors this check can be skipped. However, consider running a –doTestRun first to get an estimation of the duplicated reads.

Default: False

--chromosomeSizes, -cs

File with the chromosome sizes for your genome. A tab-delimited two column layout “chr_name size” is expectedUsually the sizes can be determined from the SAM/BAM input files, however, for cHi-C or scHi-C it can be that at the start or end no data is present. Please consider that this option causes that only reads are considered which are on the listed chromosomes.Use this option to guarantee fixed sizes. An example file is available via UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/dm3.chrom.sizes

--version

show program’s version number and exit

Please note that the file type extension for the output matrix (--outFileName) must be given! This can be .h5, .cool or the specializations of cool .mcool, if the path is given! For .scool files please create one .cool file per cell and merge it together with scHiCExplorer’s scHicMergeToSCool.

Building multicooler matrices

hicBuildMatrix supports building multicooler matrices which are for example needed for visualization with HiGlass. To do so, use as out file format either .cool or .mcool and define the desired resolutions as –binSize. hicBuildMatrix builds the interaction matrix for the highest resolution and merges the bins for the lower resolutions. The lower resolutions need to be an integer multiplicative of the highest resolution.

$ hicBuildMatrix -s forward.bam reverse.bam -o multi_resolution.cool
  --binSize 10000 20000 50000 100000 --QCfolder QC

Introducing with version 3.5 we support multiple restriction and dangling end sequences, and multiple restriction cut site files. Hi-C protocols that use multiple restriction cut enzymes benefit from this and get now an improved QC report. Version 3.5 adds also the support for a chromosome size file which can help to get interaction matrices with a predefined size. Capture Hi-C or single-cell Hi-C data, where it is not guaranteed that reads from all areas of the chromosome are present benefit from this latest improvement.

Missing scaffolds or contigs in the Hi-C matrix

Restriction enzymes cut the DNA at their specific restriction cut site sequences. It can occur for scaffolds or contigs (less likely it happens for chromosomes) that it does not contain any of these cut sites. In this case, the reads from this scaffold or contig are considered as invalid and are part of the ‘same restriction fragment’-statistics in the QC report.