hicQuickQC

Background

This tool considers the first 1,000,000 reads (or user defined number) of the mapped bam files to get a quality estimate of the Hi-C data.

Description

The tool hicQuickQC considers the first n lines of two bam/sam files to get a first estimate of the quality of the data. It is highly recommended to set the restriction enzyme and dangling end parameter to get a good quality report.

usage: hicQuickQC --samFiles two sam files two sam files --QCfolder FOLDER
                  --restrictionCutFile BED file [BED file ...]
                  --restrictionSequence RESTRICTIONSEQUENCE
                  [RESTRICTIONSEQUENCE ...] --danglingSequence
                  DANGLINGSEQUENCE [DANGLINGSEQUENCE ...] [--lines LINES]
                  [--help] [--version]

Required arguments

--samFiles, -s

The two PE alignment sam files to process.

--QCfolder

Path of folder to save the quality control data of the matrix. The log files produced this way can be loaded into hicQC in order to compare the quality of multiple Hi-C libraries.

--restrictionCutFile, -rs

BED file(s) with all restriction cut places (output of “findRestSite” command). Should contain only mappable restriction sites. If given, the bins are set to match the restriction fragments (i.e. the region between one restriction site and the next). Alternatively, a fixed binSize can be defined instead. However, either binSize or restrictionCutFile must be defined. To use more than one restriction enzyme, generate for each one a restrictionCutFile and list them space seperated.

--restrictionSequence, -seq

Sequence of the restriction site, if multiple are used, please list them space seperated. If a dangling sequence is listed at the same time, please preserve the same order.

--danglingSequence

Sequence left by the restriction enzyme after cutting, if multiple are used, please list them space seperated and preserve the order. Each restriction enzyme recognizes a different DNA sequence and, after cutting, they leave behind a specific “sticky” end or dangling end sequence. For example, for HindIII the restriction site is AAGCTT and the dangling end is AGCT. For DpnII, the restriction site and dangling end sequence are the same: GATC. This information is easily found on the description of the restriction enzyme. The dangling sequence is used to classify and report reads whose 5’ end starts with such sequence as dangling-end reads. A significant portion of dangling-end reads in a sample are indicative of a problem with the re-ligation step of the protocol.

Optional arguments

--lines

Number of lines to consider for the QC test run.

Default: 1000000

--version

show program’s version number and exit

For more information see encode .