This repo includes data processing pipeline of HiCFoundation paper.
HiCFoundation is a generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species.
Copyright (C) 2024 Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang
License: Apache License 2.0
Contact: Sergei Doulatov ([email protected]) & William Stafford Noble ([email protected]) & Sheng Wang ([email protected])
For technical problems or questions, please reach to Xiao Wang ([email protected]) and Yuanyuan Zhang ([email protected]).
Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, & Sheng Wang. A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species. bioRxiv, 2024. Paper
@article{wang2024hicfoundation,
title={A generalizable Hi-C foundation model for chromatin architecture, single-cell and multi-omics analysis across species},
author={Xiao Wang, Yuanyuan Zhang, Suhita Ray, Anupama Jha, Tangqi Fang, Shengqi Hang, Sergei Doulatov, William Stafford Noble, and Sheng Wang},
journal={bioRxiv},
year={2024}
}
Please follow the instructions of HiCFoundation repo to install HiCFoundation and configure its environment.
This section includes detailed steps for collecting data for HiCFoundation pre-training and fine-tuning.
Please follow the instructions in notebook to download the .hic data needed for pre-training or other purposes.
To use such data for pre-training like HiCFoundation, please see Pre-training section to use the downloaded data for pre-training purposes.
Please follow the instructions in notebook to download the needed files for reproducibility task.
To further convert those files to .hic file for further processing, please check the
4DN_pipeline to convert .bam/.pairs file to .hic file for further processing.
To use such data for fine-tuning HiCFoundation, please see Fine-tuning section to use the downloaded data for fine-tuning purposes.
Please follow the instructions in notebook to download the needed files for loop detection task.
To further convert those files to .hic file for further processing, please check the
4DN_pipeline to convert .bam/.pairs file to .hic file for further processing.
To use such data for fine-tuning HiCFoundation, please see Fine-tuning section to use the downloaded data for fine-tuning purposes.
Please follow the instructions in notebook to download the needed files for resolution enhancement task.
To use such data for fine-tuning HiCFoundation, please see Fine-tuning section to use the downloaded data for fine-tuning purposes.
Please follow the instructions in notebook to download the needed files for epigenomic profiling task.
To use such data for fine-tuning HiCFoundation, please see Fine-tuning section to use the downloaded data for fine-tuning purposes.
Please follow the instructions in notebook to download the needed files for single-cell HiC resolution enhancement task.
To use such data for fine-tuning HiCFoundation, please see Fine-tuning section to use the downloaded data for fine-tuning purposes.
Please follow the instructions in notebook to download the needed files for multi-species analysis.
Then please run inference on the processed .pkl file following instructions in HiCFoundation repo.
The related Hi-C files can be downloaded from GEO website.
For example, you can download the HSPC control Hi-C files from link.
Then please run inference on the .hic file following instructions in HiCFoundation repo.
Pre-training pipeline of HiCFoundation
Please check 1. All Hi-C experiments downloading to download all Hi-C data for pre-training purposes.
We can support the Hi-C experiments recorded in the following format. Please use the following script under utils
directory to convert them into .pkl file for further processing.
- .hic file: Please use hic2array.py script to convert all cis, trans contact to .pkl file.
- .cool file: Please use cool2array.py script to convert all cis, trans contact to .pkl file.
- .pairs file: Please user pairs2array.py script to convert all contact to .pkl file.
All the instructions in run is included in the script. You can simply run the following command to get instructions for each converting script:
python3 [script.py]
Then you can see detailed instructions in the command line.
We used 5kb resolution for pre-training to include more data for training.
Please run the following command to generate submatrices from ,pkl file:
python3 utils/scan_array.py --input_pkl_path [pkl_path] --input_row_size 448 \
--input_col_size 448 --stride_row 224 --stride_col 224 \
--output_dir [output_dir] --filter_threshold 0.01
- pkl_path: str, input pickle path
- output_dir: str, output directory
- filter_threshold: float, the threshold to filter out the submatrices with too many zeros. Here we filtered submatrices that did not have 1% entries have nonzero reads.
The suggested submatrices output of each pkl should be put under the output_dir/[hic_id]
, that can be easily processed by the pre-training framework in HiCFoundation repo.
After preparing the data, please follow the pre-training framework instructions on HiCFoundation.
Then you can train HiCFoundation from scratch.
Fine-tuning pipeline of HiCFoundation
Please follow the instructions to download data for different tasks of HiCFoundation.
- Reproducibility task.
- Chromatin loop detection task.
- Resolution enhancement task.
- Epigenomic assay profiling task.
- Single-cell Hi-C analysis.
They are under different sections in the dataset collection section. Please check corresponding section for more details.
The submatrices should be saved in .pkl format for HiCFoundation fine-tuning framework processing.
"input": the input Hi-C/scHi-C matrix in scipy.sparse or numpy.array format, shape: (M,N);
"input_count": the total count of Hi-C expriment, should be a float scalar value; (optional)
"2d_target": the output Hi-C/scHi-C matrix in scipy.sparse or numpy.array format, shape: (M,N); (optional)
"embed_target": the embedding 1D vector in numpy.array format, shape: (512); (optional)
"1d_target": the 1D target vector in numpy.array format, shape: (M); (optional)
First convert all .hic files to .pkl files using specified resolution.
python3 utils/hic2array.py {input_hic} {output_pkl} {resolution} 0 2
- {input_hic} is the input Hi-C file path
- {output_pkl} is the converted .pkl file path
- {Resolution} is the resolution for analysis (integer). 25000 (25kb) for the reproducibility task, 10000 (10kb) for resolution enhancement/loop detection task, 1000 (1kb) for epigenomic assay task, 1000000 (1 Mb) for single-cell analysis
- 0 indicates None normalization applied, 2 indicates saving cis-contact in scipy.sparse coo_matrix format.
For reproducibility/loop/resolution/single-cell task, please run the following command line to generate submatrices.
python3 utils/scan_array_diag.py --input_pkl_path [pkl_path] --input_row_size 224 \
--input_col_size 224 --stride 20 \
--output_dir [output_dir]
This script will generate many submatrices for fine-tuning of different tasks.
- [pkl_path]: The processed pkl file generated from last step.
- [output_dir]: The output directory
For epigenomic assay profiling task, please run the following command line to generate submatrices.
python3 utils/scan_array_diag_center.py --input_pkl_path [pkl_path] --input_row_size 128 \
--input_col_size 4000 --stride 32 \
--output_dir [output_dir]
Here we need to make sure the training samples of the center of columns in the submatrices corresponds to the center of the diagonal line.
No further labels are needed. Based on Supplementary Table Sup3 sheet, embeddings of any submatrix from BR should be similar, while from NR should be different.
Then please integrate triplet loss in loss_function in HiCFoundation repo.
import torch.nn as nn
import torch.nn.functional as F
criterion = nn.TripletMarginWithDistanceLoss(
distance_function=lambda x, y: 1.0 - F.cosine_similarity(x, y),margin=1.0)
loss = criterion(anchor, positive, negative)
where the anchor, positive, negative are the embeddings of BR1, BR2 and NR, respectively.
Please first run HiCCUPs to call loops at each BR separately by taking the processed .hic file as input.
python3 utils/hiccups_loop.py [hicFile] [output_dir] [resolution]
- [hicFile] is the processed .hic file from .bam/.pairs file (illustrated in dataset preparing section)
- [output_dir] is the output dir to store the detected loops in .bedpe format
- [resolution] specified resolution to run, can be choice of 5000 (5kb), 10000 (10kb) and 25000 (25kb).
Then you can merge the loop calls from two BRs, and use the consensus loop to train the model.
To merge the loop calls, please run the following command
python3 utils/merge_BRloop.py [BR1_loop.bedpe] [BR2_loop.bedpe] [resolution] [output.bedpe]
- [BR1_loop.bedpe] is the hiccups loop call from BR1.hic, which stored in
merged_loops.bedpe
in your specified directory. - [BR2_loop.bedpe] is the hiccups loop call from BR2.hic, which stored in
merged_loops.bedpe
in your specified directory. - [resolution] is the resolution of loop calls, can be choice of 5000 (5kb), 10000 (10kb) and 25000 (25kb).
- [output.bedpe] is the specified path to store the consensus loop
You can then use the [output.bedpe] to modify each submatrce's "2d_target" key in .pkl file.
In our setting, we assigned We the neighboring 5×5 pixels (50 kb×50 kb at 10kb resolution) of the loop calls as pixel-wise loop labels.
You can modify the assign_label function to assign pixel-level loop target for model's training.
Here the input should be downsampled submatrix, the output should be the original submatrix.
The 1st step is to get the downsampled pair of a Hi-C experiment. You can do by the following command:
python3 utils/downsample_pkl.py [input.pkl] [downsample.pkl] [downsample_ratio]
- [input.pkl] the input pickle that includes all Hi-C information, which is processed in dataset collection section by converting .hic/.cool/.pairs data.
- [downsample.pkl] the output pickle that included ddownsampled Hi-C information.
- [downsample_ratio] the downsample ratio applied to the Hi-C.
Then you can moidfy scan_array_diag.py function to scan across [input.pkl] and [downsample.pkl], then the submatrices from [input.pkl] should be saved into 2d_target
key, and the submatrices from [downsample.pkl] should be saved into input
key.
After collecting the .bigWig files of different epigenomic assays, please first convert them into .pkl file for further processing.
python3 utils/bigwig2array.py [input_bw] [output_pkl] [resolution]
[input_bw]: the input bigwig file.
[output_pkl]: the output pkl file with [chrom]:[signal] format.
[resolution]: the output resolution of the signal.
Here our resolution for epigenomic assay is 1000 (1kb).
After obtaining the epigenomic .pkl file, then please update "1d_target" key in the submatrix's pkl file. You can simply update the script to also get the corresponding 1D signal from the epigenomic .pkl file and save it to the "1d_target" key in the submatrix .pkl file.
Using the prepared submatrix .pkl files, please follow the instructions on Fine-tuning framework of HiCFoundation on HiCFoundation repo to start fine-tuning HiCFoundation for specific tasks.
Inference pipeline of HiCFoundation
For inference of HiCFoundation for different tasks, please see instructions of [inference of HiCFoundation](https://github.com/Noble-Lab/HiCFoundation/tree/main?tab=readme-ov-file#inference-of-fine-tuned-hicfoundation) on HiCFoundation repo to do inference for different tasks.Recommended: [Google Colab](https://github.com/Noble-Lab/HiCFoundation/blob/main/HiCFoundation.ipynb). Please consider to use Google colab to do online inference if you only wanted to test a few examples, where the environment is automatically configured.