Using an alignment from a program that supports local alignment (eg. Bowtie2) where both PE reads are mapped using the –local option, this program reads such file and creates a matrix of interactions.
usage: hicBuildMatrix --samFiles two sam files two sam files --outFileName FILENAME --QCfolder FOLDER [--outBam bam file] (--binSize BINSIZE | --restrictionCutFile BED file) [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE] [--maxLibraryInsertSize MAXLIBRARYINSERTSIZE] [--restrictionSequence RESTRICTIONSEQUENCE] [--danglingSequence DANGLINGSEQUENCE] [--region CHR:START-END] [--keepSelfCircles] [--minMappingQuality MINMAPPINGQUALITY] [--threads THREADS] [--inputBufferSize INPUTBUFFERSIZE] [--doTestRun] [--skipDuplicationCheck] [--help] [--version]
|–samFiles, -s||The two PE alignment sam files to process|
|Output file name for the Hi-C matrix.|
|–QCfolder||Path of folder to save the quality control data for the matrix. The log files produced this way can be loaded into hicQC in order to compare the quality of multiple Hi-C libraries.|
|–outBam, -b||Output bam file to process. Optional parameter. A bam file containing all valid Hi-C reads can be created using this option. This bam file could be useful to inspect the distribution of valid Hi-C reads pairs or for other downstream analyses, but is not used by any HiCExplorer tool. Computation will be significantly longer if this option is set.|
Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. Alternatively, the location of the restriction sites can be given (see –restrictionCutFile).
|BED file with all restriction cut places (output of “findRestSite” command). Should contain only mappable restriction sites. If given, the bins are set to match the restriction fragments (i.e. the region between one restriction site and the next).|
Minimum distance between restriction sites. Restriction sites that are closer than this distance are merged into one. This option only applies if –restrictionCutFile is given.
|–maxDistance||This parameter is now obsolete. Use –maxLibraryInsertSize instead|
The maximum library insert size defines different cut offs based on the maximum expected library size. This is not the average fragment size but the higher end of the the fragment size distribution (obtained using for example a Fragment Analyzer or a Bioanalyzer) which usually is between 800 to 1500 bp. If this value if not known use the default of 1000. The insert value is used to decide if two mates belong to the same fragment (by checking if they are within this max insert size) and to decide if a mate is too far away from the nearest restriction site.
|Sequence of the restriction site.|
|Sequence left by the restriction enzyme after cutting. Each restriction enzyme recognizes a different DNA sequence and, after cutting, they leave behind a specific “sticky” end or dangling end sequence. For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT. For DpnII, the restriction site and dangling end sequence are the same: GATC. This information is easily found on the description of the restriction enzyme. The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads. A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol.|
|–region, -r||Region of the genome to limit the operation to. The format is chr:start-end. It is also possible to just specify a chromosome, for example –region chr10|
If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots.
minimum mapping quality for reads to be accepted. Because the restriction enzyme site could be located on top of the read, this may reduce the reported quality of the read. Thus, this parameter may be adusted if too many low quality (but otherwise perfectly valid Hi-C reads) are found. A good strategy is to make a test run (using the –doTestRun), then checking the results to see if too many low quality reads are present and then using the bam file generated to check if those low quality reads are caused by the read not being mapped entirely.
Number of threads. Using the python multiprocessing module. One master process which is used to read the input file into the buffer and one process which is merging the output bam files of the processes into one output bam file. All other threads do the actual computation. Minimum value for the ‘–thread’ parameter is 2. The usage of 8 threads is optimal if you have an HDD. A higher number of threads is only useful if you have a fast SSD. Have in mind that the performance of hicBuildMatrix is influenced by the number of threads, the speed of your hard drive and the inputBufferSize. To clearify: the peformance with a higher thread number is not negative influenced but not positiv too. With a slow HDD and a high number of threads many threads will do nothing most of the time.
Size of the input buffer of each thread. 400,000 read pairs per input file per thread is the default value. Reduce this value to decrease memory usage.
A test run is useful to test the quality of a Hi-C experiment quickly. It works by testing only 1,000,000 reads. This option is useful to get an idea of quality control values like inter-chromosomal interactions, duplication rates etc.
Identification of duplicated read pairs is memory consuming. Thus, in case of memory errors this check can be skipped. However, consider running a –doTestRun first to get an estimation of the duplicated reads.
|–version||show program’s version number and exit|