This repository provides tools for calculating the number of DJ counts in the genome aligned to the GRCh38 (broad reference).
The DJ counts are determined by assessing the coverage of specific regions across the GRCh38 reference genome.
- Calculating the background coverage
- Calculating the coverage of DJ regions scattered across GRCh38 reference
- Calculating the number of DJ counts
BED files are provided in this repository (link here). Aligned BAM files should be based on the Broad GRCh38 reference. You can download the reference from Broad Github.
# This is an example
sample=Sample01
threads=10
bam="/data/01.broad_hg38/$sample/$sample.dedup.bam" # BAM or CRAM file
outdir="/data/01.broad_hg38/$sample" # The output directory
prefix="$sample" # Prefix for output files
bed="/data/01.broad_hg38/uk.dj.bed" # BED file for DJ counting
We calculate background coverage using autosomal chromosomes from the GRCh38 reference. This is achieved by analyzing aligned BAM files with tools like samtools depth or samtools idxstats. Ensure that your BAM files contain the necessary contigs for accurate DJ coverage calculation. Background coverage is computed as the median depth across autosomal chromosomes.
## Calculate the read length
fragmentSize=$(samtools view $bam | head -10000 | awk '{print length($10)}' | sort -n | awk '{a[i++]=$1} END {print a[int(i/2)];}')
echo $fragmentSize
## Calculate the total background
bgCov=$(for i in {1..22}; do
samtools coverage -r chr${i} "$bam" | sed -n 2p | awk -v frag="$fragmentSize" '{print $4/$3*frag}'
done | sort -n | awk '{a[NR]=$1} END {print a[int(NR/2)]};')
echo $bgCov
The provided BED file contains regions highly similar to DJ regions on CHM13 chromosome 13, excluding high variable or repeat regions.
Coverage calculation utilizes samtools depth
, with computational time typically under 5~8 minutes using 10 threads, depending on BAM file size.
# Calculate the DJ regions' coverage
sum=$(samtools depth -@ $threads -b $bed $bam | awk 'BEGIN { SUM=0 } { SUM+=$3 } END { print SUM }')
echo $sum
DJ counts are derived by dividing the total depth on DJ regions in GRCh38 by the background coverage. The total length of DJ regions used in this analysis is fixed at 136,405 bp. Results are presented in diploid genome bases by multiplying by 2.
covLen=136405
djCount=$(echo "scale=5; 2 * $sum / $covLen / $bgCov" | bc)
echo -e "$prefix\t$fragmentSize\t$bgCov\t$sum\t$djCount" > $outdir/$prefix.dj_hg38.txt
The output file $outdir/$prefix.dj_hg38.txt
contains two columns separated by tabs:
- $prefix: This column contains identifiers or names associated with each estimation of DJ count.
- Estimation of DJ count based on diploid genome: This column provides the calculated DJ count values adjusted for diploid genome context.
Sample01 8.30800
Normal human samples typically yield around 10 copies of DJ counts, with occasional deviations to ~11 or ~9. Robertsonian samples usually show approximately ~8 copies.
V0.1(2024-07-17)
* first commitV0.2(2024-07-25)
* Changing the background coverage estimation methods from samtools idxstats to samtools coverage.* Removing the step of saving temporary files; instead, we assign everything to variables.
V0.2.1(2024-07-29)
* Add background and fragment size to the output file.* Fix the command line used for calculating the background to ensure it works correctly.