Using an alignment from a program that supports local alignment (eg. Bowtie2) where both PE reads are mapped using the –local option, this program reads such file and creates a matrix of interactions.
usage: hicBuildMatrix [-h] --samFiles two sam files two sam files [--outBam bam file] (--binSize BINSIZE | --restrictionCutFile BED file) [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE] [--restrictionSequence RESTRICTIONSEQUENCE] [--danglingSequence DANGLINGSEQUENCE] --outFileName FILENAME --QCfolder FOLDER [--region CHR:START-END] [--keepSelfCircles] [--minMappingQuality MINMAPPINGQUALITY] [--threads THREADS] [--inputBufferSize INPUTBUFFERSIZE] [--doTestRun] [--skipDuplicationCheck] [--version]
- optional arguments
--samFiles, -s The two alignment sam files to process --outBam, -b Bam file to process. Optional parameter. An bam file containing all valid Hi-C reads can be created using this option. This bam file could be useful to inspect the distribution of valid Hi-C reads pairs or for other downstream analysis, but is not used by any HiCExplorer tool. Computation will be significant longer if this option is set. --binSize=10000, -bs=10000 Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. Alternatively, the location of the restriction sites can be given (see –restrictionCutFile). --restrictionCutFile, -rs BED file with all restriction cut places (output of “findRestSite” command). Should contain only mappable restriction sites. If given, the bins are set to match the restriction fragments (i.e. the region between one restriction site and the next). --minDistance=300 Minimum distance between restriction sites. Restriction sites that are closer than this distance are merged into one. This option only applies if –restrictionCutFile is given. --maxDistance=800 Maximum distance (in bp) from restriction site to read, to consider a read a valid one. This option only applies if –restrictionCutFile is given. --restrictionSequence, -seq Sequence of the restriction site. --danglingSequence Dangling end sequence left by the restriction enzyme. For DpnII for example, the dangling end is the same restriction sequence. This is used to discard reads that end/start with such sequence and that are considered un-ligated fragments or “dangling-ends”. If not given, such statistics will not be available. --outFileName, -o Output file name for the Hi-C matrix --QCfolder Path of folder to save the quality control data for the matrix --region, -r Region of the genome to limit the operation to. The format is chr:start-end. Also valid is just to specify a chromosome, for example –region chr10 --keepSelfCircles=False If set, outward facing reads without any restriction fragment (self circles) are kept. They will be counted and shown in the QC plots. --minMappingQuality=15 minimum mapping quality for reads to be accepted. Because the restriction enzyme site could be located on top of the read, this may reduce the reported quality of the read. Thus, this parameter may be adusted if too many low quality (but otherwise perfectly valid Hi-C reads) are found.A good strategy is to make a test run (using the –doTestRun), then checking the results to see if too many low quality reads are present and then using the bam file generated to check if those low quality reads are caused by the read not being mapped entirely. --threads=4 Number of threads. Using the python multiprocessing module. One master process which is used to read the input file into the buffer and one process which is merging the output bam files of the processes into one output bam file. All other threads do the actual computation. Minimum value for the ‘–thread’ parameter is 2.The usage of 8 threads is optimal if you have an HDD. A higher number of threads is only useful if you have a fast SSD. Have in mind that the performance of hicBuildMatrix is influenced by the number of threads, the speed of your hard drive and the inputBufferSize. To clearify: the peformance with a higher thread number is not negative influenced but not positiv too. With a slow HDD and a high number of threads many threads will do nothing most of the time. --inputBufferSize=400000 Size of the input buffer of each thread. 400,000 read pairs per input file per thread is the default value. Reduce value to decrease memory usage. --doTestRun=False A test run is useful to test the quality of a Hi-C experiment quickly. It works by testing only 1,000.000 reads. This option is useful to get an idea of quality control values like inter-chromosomal interactins, duplication rates etc. --skipDuplicationCheck=False Identification of duplicated read pairs is memory consuming. Thus, in case of memory errors this check can be skipped. However, consider running a `–doTestRun` first to get an estimation of the duplicated reads. --version show program’s version number and exit