Recent research indicates that analyzing the fragmentation patterns of circulating free DNA (cfDNA) from genome sequencing data can provide insights into nucleosome occupancy in the original cells. In accessible genomic regions, nucleosomes are arranged systematically to facilitate access for DNA-binding proteins.
This pipeline calculates the composite coverage over specific genome regions.
BAM input files are filtered by size (cfDNA fragments have sizes between 90 and 150 bp) and corrected for GC-bias using the method proposed by [Benjamini & Speed (2012). Nucleic Acids Research, 40(10)].
The input BAM is then converted into a coverage file (bigWiggle). At this stage, it is possible to use a blacklist of genomic regions. We are blacklisting the Problematic Regions of the Genome defined by ENCODE. [Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z]
With computeMatrix, we calculate the coverage of the signal (BAM file) on a set of targets (BED file). In our analysis, this is typically a set of Transcription Factor Binding Sites regions. We are using the reference-point
mode: we compute the signal distribution relative to a point (the TFBS center point in our case), expanding on both sides for 4kb.
A first plot of the composite coverage on the TFBS target set (raw coverage) is generated by the process plotHeatmap.
The process PEAK_STATS
generates 3 output files:
-
peak_data.tsv
file contains the composite coverage (raw
) for each bin (bin
), the coverage relative to the background median (relative
), and the background median (background_median
) for each bin. -
peak_stats.tsv
summary statistics for the composite coverage: signal, target, source, integration (Monte Carlo integration), length of the peak (length
), relative length of the peak (rlength
), min and max raw values (ymin
andymax
), and the ratio between length and integration (ratio
). -
PeakIntegration.pdf
pdf plot of the calculated metrics:
flowchart TB
subgraph " "
subgraph params
v4["input"]
v2["blacklist_bed"]
v0["genome_2bit"]
v6["targets"]
end
v9([FRAGMENTOMICS])
v0 --> v9
v2 --> v9
v4 --> v9
v6 --> v9
end
Create a samplesheet.csv
:
caseid,sampleid,timepoint,bam,bai,bw
PATIENT_A,PATIENT_A_T1,T1,/path/to/bam.bam,/path/to/bam.bai,
PATIENT_A,PATIENT_A_T2,T2,/path/to/bam.bam,/path/to/bam.bai,
Where:
caseid
is the patientsampleid
is the sampletimepoint
is the time groupbam
is the input BAM filebai
is the BAM indexbw
is the (optional) big wiggle file (preprocessing will be skipped)
Create a target list (targets.csv
):
name,source,bed
MYC,GRIFFIN,./tests/input/stub/myc.bed
ELK4,TIMON,./tests/input/stub/elk4.bed
rand1,house_keeping_dataset,./tests/input/stub/rand1.bed
rand2,house_keeping_dataset,./tests/input/stub/rand2.bed
rand3,house_keeping_dataset,./tests/input/stub/rand3.bed
HouseKeeping,house_keeping_dataset,./tests/input/stub/GeneHancer_housekeeping.bed
Here we are defining 6 targets:
- MYC and ELK4 transcription factor binding sites from the GRIFFIN dataset.
- 3 random datasets of random genes for housekeeping plots.
- A set of housekeeping genes for random comparisons.
Required parameters:
input: "samplesheet.csv"
targets: "targets.csv"
outdir: "./results"
genome_2bit
: Genome in two-bit format. Most genomes can be found here: http://hgdownload.cse.ucsc.edu/gbdb/
blacklist_bed
: BED file with blacklisted regions used in COVERAGEBAM (wiggle file generation) and in COMPUTEMATRIX (matrix calculation). We are using the ENCODE blacklist from:
Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z
genome_size
: effective genome size used by GC Correction functions (see also: deeptools effective genome size page)
Available profiles (see also conf/profiles.config
):
stub
: run with stub truedebug
: run with debug truedevel
: run locallyslurm
: for our HPC
Some profile examples:
Stub run with stub annotation files:
conda activate nf-fragmentomics-env
nextflow run main.nf -params-file examples/input/example_params.yaml -profile stub -stub-run
Run on single machine with conda environment:
nextflow run main.nf -profile devel,conda -params-file params.yaml
Run on slurm cluster with singularity
nextflow run main.nf -profile slurm,singularity -params-file params.yaml
flowchart TB
subgraph BAM_PREPROCESS
subgraph take
v0["bam_ch"]
v2["blacklist_bed"]
v1["genome_2bit"]
end
v3([BAMPEFRAGMENTSIZE])
v4([PLOTCOVERAGE])
v5([FILTERBAMBYSIZE])
v7([COMPUTEGCBIAS])
v8([CORRECTGCBIAS])
v10([COVERAGEBAM])
subgraph emit
v11["wiggle_ch"]
end
v0 --> v3
v0 --> v4
v0 --> v5
v1 --> v7
v5 --> v7
v7 --> v8
v2 --> v10
v8 --> v10
v10 --> v11
end
flowchart TB
subgraph TARGET_PROCESS
subgraph take
v0["target_ch"]
v2["blacklist_bed"]
v1["wiggle_ch"]
end
v4([COMPUTEMATRIX])
v5([HEATMAP])
v6([PEAK_STATS])
v0 --> v4
v1 --> v4
v2 --> v4
v4 --> v5
v4 --> v6
end
We tried the analysis with WGS and lpWGS samples.
Input BAM file must be sorted and indexed.
File name or names, in BED or GTF format, containing the regions to compute and plot.
see also computeMatrix
This pipeline use the ENCODE blacklist or other blacklist BED file to remove problematic regions from the analysis.
Param: blacklist_bed
preprocess
: if true, the pipeline filters BAM by size, applies GC correction, and converts to wiggle filebin_size
: the bin size used to generate the coverage file (big wiggle)target_expand_sx
andtarget_expand_dx
: how many bp to expand the Target region? default is 4000 bp on both sidesfilter_min
andfilter_max
: limit for reads filtering. By default is 90-150 bp
Utility scripts in the scripts
directory.
location: scripts/samplesheet_generator.py
usage: samplesheet_generator.py [-h] [-r REGEXP] [-v] [-vv] FILE [FILE ...]
Generate samplesheet.csv for nf-fragmentomics pipeline
positional arguments:
FILE BAM or wiggle files
options:
-h, --help show this help message and exit
-r REGEXP, --regexp REGEXP
Parser regexp - default: ((.*)_(.*))\..*
-v, --verbose set loglevel to INFO
-vv, --very-verbose set loglevel to DEBUG
Author: Davide Rambaldi
Usage example with test files:
./scripts/samplesheet_generator.py tests/input/samplesheet_generator/*/*.{bw,bam}
location: scripts/targets_generator.py
usage: targets_generator.py [-h] [-v] [-vv] FILE [FILE ...]
Generate targets.csv for nf-fragmentomics pipeline
positional arguments:
FILE BED files
options:
-h, --help show this help message and exit
-v, --verbose set loglevel to INFO
-vv, --very-verbose set loglevel to DEBUG
Author: Davide Rambaldi
Usage example with test files:
./scripts/targets_generator.py tests/input/targets_generator/*.bed
Some analysis examples with pseudocode
To verify if in our dataset it is possible to see a signal from open chromatin regions, we can do the following experiment:
- We take all the TSS from GeneHancer.
- We calculate the composite coverage (the peak) of all TSS on housekeeping genes (that should always be active).
- We then calculate the composite coverage for 100 sets of randomly picked genes NOT housekeeping.
We first need to load the peak_data.tsv
files.
These files contain 4 columns:
raw
is the raw signalbin
is the x positionrelative
is the signal relative to the background_medianbackground_median
is the median calculated as shown above
In R, for example, we can do this:
mdata <- read_delim("path/to/peak_data.tsv", show_col_types = FALSE)
We can load the random dataset in an object (random in the next example) and the housekeeping dataset in another object (hk in the next example).
We can now combine the 2 data in a plot with ggplot:
ggplot() +
geom_line(data = random, aes(x = bin, y = relative), color="grey", show.legend = FALSE) +
geom_line(data = housekeeping, aes(x = bin, y = relative), color="red", show.legend = FALSE) +
scale_x_continuous(
"Position relative to TSS (bp)",
breaks = c(0,200,400,600,800),
labels = c("-4kb","-2kb","0","2kb","4kb")
) +
ylab("Relative coverage") +
ggtitle(plot.title) +
scale_color_manual(values=c("grey","red")) +
theme(legend.position = "bottom")
The housekeeping signal (red line in the plot) is quite different compared to random gene sets! Do you agree?
Here we want to plot the composite coverage for a single Transcription Factor for a cohort across different timepoints.
We start as above from the peak_data.tsv
files where:
raw
is the raw signalbin
is the x positionrelative
is the signal relative to the background_medianbackground_median
is the median calculated as shown above
In this case, it is better to load the single file and merge the data into a single object:
For each peak_data file:
mdata <- read_delim("path/to/peak_data.tsv", show_col_types = FALSE)
Mutate rdata to add sample, case, and timepoint:
mdata <- mdata |> mutate(
sample=sample_name,
case=case_name,
timepoint=timepoint_name
)
You can now combine different data objects into a single object using bind_rows
.
On the final dataset, we can use the bins
on the X axis, the relative
signal on the Y axis. We group and color observations by timepoint and we split (facet_wrap) plots by patient.
Result:
Given a set of samples and a list of targets, we can build a heatmap of peak lengths or other peak statistics.
In this example, we use ComplexHeatmap and the R package to draw and arrange multiple heatmaps.
To test the R peakStats script:
cd peakStats
../../bin/fragmentomics_peakStats.R -s "Signal" -t "Target" -S "Source" --background-left-limit 50 --background-right-limit 50 ../input/matrix/FOXA2_regions_matrix.gz