Using an alignment from a program that supports local alignment (eg. Bowtie2) where both PE reads are mapped using the –local option, this program reads such file and creates a matrix of interactions.
usage: hicBuildMatrix [-h] --samFiles two sam files two sam files [--outBam bam file] (--binSize BINSIZE | --restrictionCutFile BED file) [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE] [--maxLibraryInsertSize MAXLIBRARYINSERTSIZE] [--restrictionSequence RESTRICTIONSEQUENCE] [--danglingSequence DANGLINGSEQUENCE] --outFileName FILENAME --QCfolder FOLDER [--region CHR:START-END] [--keepSelfCircles] [--minMappingQuality MINMAPPINGQUALITY] [--threads THREADS] [--inputBufferSize INPUTBUFFERSIZE] [--doTestRun] [--skipDuplicationCheck] [--version]
|–samFiles, -s||The two alignment sam files to process|
|–outBam, -b||Bam file to process. Optional parameter. An bam file containing all valid Hi-C reads can be created using this option. This bam file could be useful to inspect the distribution of valid Hi-C reads pairs or for other downstream analysis, but is not used by any HiCExplorer tool. Computation will be significant longer if this option is set.|
Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. Alternatively, the location of the restriction sites can be given (see –restrictionCutFile).
|BED file with all restriction cut places (output of “findRestSite” command). Should contain only mappable restriction sites. If given, the bins are set to match the restriction fragments (i.e. the region between one restriction site and the next).|
Minimum distance between restriction sites. Restriction sites that are closer than this distance are merged into one. This option only applies if –restrictionCutFile is given.
|–maxDistance||This parameter is now obsolete. Use –maxLibraryInsertSize instead|
The maximum library insert size defines different cut offs based on the maximum expected library size. This is not the average fragment size but the higher end of the the fragment size distribution (obtained using for example Fragment Analyzer) which usually is between 800 to 1500 bp. If this value if not known use the default of 1000. The insert value is used to decide if two mates belong to the same fragment (by checking if they are within this max insert size) and to decide if a mate is too far away from the nearest restriction site.
|Sequence of the restriction site.|
|Dangling end sequence left by the restriction enzyme. For DpnII for example, the dangling end is the same restriction sequence. This is used to discard reads that end/start with such sequence and that are considered un-ligated fragments or “dangling-ends”. If not given, such statistics will not be available.|
|Output file name for the Hi-C matrix|
|–QCfolder||Path of folder to save the quality control data for the matrix|
|–region, -r||Region of the genome to limit the operation to. The format is chr:start-end. Also valid is just to specify a chromosome, for example –region chr10|
If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots.
minimum mapping quality for reads to be accepted. Because the restriction enzyme site could be located on top of the read, this may reduce the reported quality of the read. Thus, this parameter may be adusted if too many low quality (but otherwise perfectly valid Hi-C reads) are found.A good strategy is to make a test run (using the –doTestRun), then checking the results to see if too many low quality reads are present and then using the bam file generated to check if those low quality reads are caused by the read not being mapped entirely.
Number of threads. Using the python multiprocessing module. One master process which is used to read the input file into the buffer and one process which is merging the output bam files of the processes into one output bam file. All other threads do the actual computation. Minimum value for the ‘–thread’ parameter is 2.The usage of 8 threads is optimal if you have an HDD. A higher number of threads is only useful if you have a fast SSD. Have in mind that the performance of hicBuildMatrix is influenced by the number of threads, the speed of your hard drive and the inputBufferSize. To clearify: the peformance with a higher thread number is not negative influenced but not positiv too. With a slow HDD and a high number of threads many threads will do nothing most of the time.
Size of the input buffer of each thread. 400,000 read pairs per input file per thread is the default value. Reduce value to decrease memory usage.
A test run is useful to test the quality of a Hi-C experiment quickly. It works by testing only 1,000.000 reads. This option is useful to get an idea of quality control values like inter-chromosomal interactins, duplication rates etc.
Identification of duplicated read pairs is memory consuming. Thus, in case of memory errors this check can be skipped. However, consider running a –doTestRun first to get an estimation of the duplicated reads.
|–version||show program’s version number and exit|