HiCExplorer file formats

HiCExplorer has a native interaction file format, h5, and several native formats for capture Hi-C analysis.

Interaction matrix format h5

The h5 format is implemented as a hdf5 container and can be accessed either via HiCMatrix, the hdf5 API or a graphical user interface like hdfview. The data in the hdf5 format is read and written via the PyTables interface.

The format is structured as follows:

- intervals
   - chr_list (CArray)
   - end_list (CArray)
   - extra_list (CArray)
   - start_list (CArray)
- matrix
   - data (CArray)
   - indices (CArray)
   - indptr (CArray)
   - shape (tupel)
 - distance counts (CArray) [optional]
 - nan_bins (CArray) [optional]
 - correction_factors (CArray) [optional]

The group ‘matrix’ contains the data which is necessary to create a scipy sparse csr matrix:

import tables
with tables.open_file(self.matrixFileName, 'r') as f:
        parts = {}
        try:
            for matrix_part in ('data', 'indices', 'indptr', 'shape'):
                parts[matrix_part] = getattr(f.root.matrix, matrix_part).read()
        except Exception as e:
            log.info('No h5 file. Please check parameters concerning the file type!')
            e
        matrix = csr_matrix(tuple([parts['data'], parts['indices'], parts['indptr']]),
                            shape=parts['shape'])

The group ‘intervals’ contains the information of the genomic regions associated to a bin in the matrix via its index. To retrieve the data run:

with tables.open_file(self.matrixFileName, 'r') as f:
    intvals = {}
    for interval_part in ('chr_list', 'start_list', 'end_list', 'extra_list'):
        if toString(interval_part) == toString('chr_list'):
            chrom_list = getattr(f.root.intervals, interval_part).read()
            intvals[interval_part] = toString(chrom_list)
        else:
            intvals[interval_part] = getattr(f.root.intervals, interval_part).read()

    cut_intervals = list(zip(intvals['chr_list'], intvals['start_list'], intvals['end_list'], intvals['extra_list']))

To write the format, the above structure needs to be created via PyTables. For example:

with tables.open_file(filename, mode="w", title="HiCExplorer matrix") as h5file:
    matrix_group = h5file.create_group("/", "matrix", )
    for matrix_part in ('data', 'indices', 'indptr', 'shape'):
            arr = np.array(getattr(matrix, matrix_part))
            atom = tables.Atom.from_dtype(arr.dtype)
            ds = h5file.create_carray(matrix_group, matrix_part, atom,
                                      shape=arr.shape,
                                      filters=filters)

The matrix object is a scipy csr_matrix.

Please refer to HiCMatrix for a reference implementation: https://github.com/deeptools/HiCMatrix/blob/master/hicmatrix/lib/h5.py

To open a cool / h5 file with the HiCMatrix library, use the following code:

import hicmatrix.HiCMatrix as hm

hic_ma = hm.hiCMatrix("/path/to/matrix")

# csr_matrix
hic_ma.matrix

# the corresponding genomic regions as a list
hic_ma.cut_intervals

To write a h5 use the following code. Please consider, once the datatype of a matrix is specified, for example by reading a h5 matrix or by first time writing a h5 matrix, it cannot be changed to a cool file anymore (except by manipulating the internal matrixFilehandler object).

import hicmatrix.HiCMatrix as hm
from scipy.sparse import csr_matrix


# create a HiCMatrix object
hic_ma = hm.hiCMatrix()

# set important data structures

hic_ma.matrix = csr_matrix()
hic_ma.cut_intervals = list[(chr1, start1, end1, 1), (chr1, start2, end2, 1), ..., (chr1, startN, endN, 1)]

# to store a h5 matrix
hic_ma.save('/path/to/storage/matrix.h5')
# to store a cool matrix
hic_ma.save('/path/to/storage/matrix.cool')

Alternatively, the matrixFileHandler object can be accessed directly:

from hicmatrix.lib import MatrixFileHandler


# Load a matrix via the MatrixFileHandler class

matrixFileHandlerInput = MatrixFileHandler(pFileType=args.inputFormat, pMatrixFile=matrix,
                                                        pCorrectionFactorTable=args.correction_name,
                                                        pCorrectionOperator=correction_operator,
                                                        pChrnameList=chromosomes_to_load,
                                                        pEnforceInteger=args.enforce_integer,
                                                        pApplyCorrectionCoolerLoad=applyCorrectionCoolerLoad)

# Load data
_matrix, cut_intervals, nan_bins, \
   distance_counts, correction_factors = matrixFileHandlerInput.load()

# create matrixFileHandler output object
matrixFileHandlerOutput = MatrixFileHandler(pFileType=args.outputFormat, pEnforceInteger=args.enforce_integer, pFileWasH5=format_was_h5, pHic2CoolVersion=hic2CoolVersion, pHiCInfo=cool_metadata)

# set the variables
matrixFileHandlerOutput.set_matrix_variables(_matrix, cut_intervals, nan_bins,
                                                          correction_factors, distance_counts)

matrixFileHandlerOutput.save(args.outFileName, pSymmetric=True, pApplyCorrection=applyCorrection)

Capture Hi-C HDF containers

The capture Hi-C data analysis creates for the scripts chicViewpoint, chicSignificantInteractions, chicAggregateStatistic and chicDifferentialTest individual HDF containers to store the processed data.

chicViewpoint

- averageContactBin (int)
- fixateRange (int)
- range (int, int)
- resolution (int)
- type='interactions' (string)
- matrix 1
 - chromosome 1
    - gene name 1
        - chromosome (String)
        - end_list (array)
        - gene (String)
        - interaction_data_list (array)
        - pvalue (array)
        - raw (array)
        - reference_point_end (int)
        - reference_point_start (int)
        - relative_position_list (array)
        - start_list (array)
        - sum_of_interactions (float)
        - xfold (array)
    - gene name 2
        - ...
        - ...
     - ...
 - chromosome 2
    - gene 1
       - ...
       - ...
    - ...
 - ...
 - ...
 - genes
   - gene name 1 (link to matrix 1 / chromosome 1 / gene name 1
   - ...
- matrix 1
  - chromosome 1
     - gene name 1
         - chromosome (String)
         - end_list (array)
         - gene (String)
         - interaction_data_list (array)
         - pvalue (array)
         - raw (array)
         - reference_point_end (int)
         - reference_point_start (int)
         - relative_position_list (array)
         - start_list (array)
         - sum_of_interactions (float)
         - xfold (array)
     - gene name 2
         - ...
         - ...
     - ...
  - chromosome 2
     - gene 1
        - ...
        - ...
  - ...
  - ...

 - genes
   - gene name 1 (link to matrix 2 / chromosome 1 / gene name 1
   - ...

chicSignificantInteractions

chicSignificantInteractions creates two files: a target file and a file containing the significant interactions:

Depending on the combination mode (single / dual) the structure is slightly different for the target file:

- combinationMode = 'single' (String)
- fixateRange (int)
- mode_preselection (String)
- mode_preselection_value (float)
- peakInteractionsThreshold (float)
- pvalue (float)
- range (int, int)
- truncateZeroPvalues (Boolean)
- type='target' (String)
- matrix 1
 - chromosome 1
    - gene name 1
        - chromosome (String)
        - end_list (array)
        - reference_point_end (int)
        - reference_point_start (int)
        - start_list (array)
    - gene name 2
        - ...
        - ...
     - ...
 - chromosome 2
    - gene 1
       - ...
       - ...
 - ...
 - ...
 - genes
   - gene name 1 (link to matrix 1 / chromosome 1 / gene name 1
   - ...
- matrix 1
  - chromosome 1
     - gene name 1
        - chromosome (String)
        - end_list (array)
        - reference_point_end (int)
        - reference_point_start (int)
        - start_list (array)
     - gene name 2
         - ...
         - ...
      - ...
  - chromosome 2
     - gene 1
        - ...
        - ...
  - ...
  - ...

 - genes
   - gene name 1 (link to matrix 2 / chromosome 1 / gene name 1
   - ...

Combination mode ‘dual’ combines the target regions of one viewpoint region of two matrices.

- combinationMode = 'dual' (String)
- fixateRange (int)
- mode_preselection (String)
- mode_preselection_value (float)
- peakInteractionsThreshold (float)
- pvalue (float)
- range (int, int)
- truncateZeroPvalues (Boolean)
- type='target' (String)
- matrix 1
   - matrix 2
      - chromosome 1
         - gene name 1
            - chromosome (String)
            - end_list (array)
            - reference_point_end (int)
            - reference_point_start (int)
            - start_list (array)
         - gene name 2
            - ...
            - ...
         - ...
      - chromosome 2
         - gene 1
            - ...
            - ...
      - genes
         - gene name 1 (link to matrix 1 / matrix 2 / chromosome 1 / gene name 1
         - ...
      - matrix 3
         - chromosome 1
            - gene name 1
               - ...
         - ...
         - genes
            - gene name 1 (link to matrix 1 / matrix 3 / chromosome 1 / gene name 1
            - ...
- matrix 2
   - matrix 3
      - chromosome 1
         - gene name 1
            - ...
         - ...
      - ...
   - ...
      - genes
         - gene name 1 (link to matrix 2 / matrix 3 / chromosome 1 / gene name 1
         - ...
- ...

Significant interactions file in the ‘single’ and ‘dual’ mode don’t have a difference in their structure:

- combinationMode  = 'single' (String) / 'dual' (String)
- fixateRange (int)
- mode_preselection (String)
- mode_preselection_value (float)
- peakInteractionsThreshold (float)
- pvalue (float)
- range (int, int)
- truncateZeroPvalues (Boolean)
- type='significant' (String)
- matrix 1
 - chromosome 1
    - gene name 1
         - chromosome (String)
         - end_list (array)
         - gene (String)
         - interaction_data_list (array)
         - pvalue (array)
         - raw (array)
         - reference_point_end (int)
         - reference_point_start (int)
         - relative_position_list (array)
         - start_list (array)
         - sum_of_interactions (float)
         - xfold (array)
    - gene name 2
        - ...
        - ...
     . ---
 - chromosome 2
    - gene 1
       - ...
       - ...
 - ...
 - ...
 - genes
   - gene name 1 (link to matrix 1 / chromosome 1 / gene name 1
   - ...
- matrix 1
  - chromosome 1
     - gene name 1
         - chromosome (String)
         - end_list (array)
         - gene (String)
         - interaction_data_list (array)
         - pvalue (array)
         - raw (array)
         - reference_point_end (int)
         - reference_point_start (int)
         - relative_position_list (array)
         - start_list (array)
         - sum_of_interactions (float)
         - xfold (array)
     - gene name 2
         - ...
         - ...
      . ---
  - chromosome 2
     - gene 1
        - ...
        - ...
  - ...
  - ...

 - genes
   - gene name 1 (link to matrix 2 / chromosome 1 / gene name 1
   - ...

chicAggregateStatistic

- type='aggregate' (String)
- matrix_1_matrix_2
   - matrix 1
      - chromosome 1
         - gene name 1
               - chromosome (String)
               - end_list (array)
               - gene_name (String)
               - interaction_data_list (array)
               - raw_target_list (array)
               - relative_distance_list (array)
               - start_list (array)
               - sum_of_interactions (float)
         - gene name 2
        - ...
        - ...
      - genes
         - gene name 1 (link to matrix_1_matrix_2 / matrix 1 / chromosome 1 / gene name 1
         - ...
   - matrix 2
      - chromosome 1
            - gene name 1
                  - chromosome (String)
                  - end_list (array)
                  - gene_name (String)
                  - interaction_data_list (array)
                  - raw_target_list (array)
                  - relative_distance_list (array)
                  - start_list (array)
                  - sum_of_interactions (float)
            - gene name 2
         - ...
         - ...
      - genes
         - gene name 1 (link to matrix_1_matrix_2/ matrix 2 / chromosome 1 / gene name 1
         - ...
- matrix_2_matrix_2
   - matrix 2
      - ...
   - matrix 3
      - ...

chicDifferentialTest

- type='differential' (String)
- alpha (float)
- test: fisher / chi2 (String)
- matrix_1
   - matrix 2
      - chromosome 1
         - gene name 1
               - accepted
                  - chromosome (String)
                  - end_list (array)
                  - gene (String)
                  - interaction_data_list (array)
                  - pvalue (array)
                  - raw_target_list_1 (array)
                  - raw_target_list_2 (array)
                  - relative_distance_list (array)
                  - start_list (array)
                  - sum_of_interactions_1 (float)
                  - sum_of_interactions_2 (float)
               - all
                  - chromosome (String)
                  - end_list (array)
                  - gene (String)
                  - interaction_data_list (array)
                  - pvalue (array)
                  - raw_target_list_1 (array)
                  - raw_target_list_2 (array)
                  - relative_distance_list (array)
                  - start_list (array)
                  - sum_of_interactions_1 (float)
                  - sum_of_interactions_2 (float)
               - rejected
                  - chromosome (String)
                  - end_list (array)
                  - gene (String)
                  - interaction_data_list (array)
                  - pvalue (array)
                  - raw_target_list_1 (array)
                  - raw_target_list_2 (array)
                  - relative_distance_list (array)
                  - start_list (array)
                  - sum_of_interactions_1 (float)
                  - sum_of_interactions_2 (float)
         - gene name 2
               - accepted
               - all
               - rejected
         - ...
      - genes
         - gene name 1 (link to matrix_1 / matrix 2 / chromosome 1 / gene name 1
         - ...
   - matrix 3
      - chromosome 1
            - gene name 1
                  - accepted
                  - all
                  - rejected
            - gene name 2
         - ...
         - ...
      - genes
         - gene name 1 (link to matrix_1 / matrix 3 / chromosome 1 / gene name 1
         - ...
   - ...
- matrix 2
   - matrix 3
      - ...