NF-FRAGMENTOMICS PIPELINE

Introduction

Recent research indicates that analyzing the fragmentation patterns of circulating free DNA (cfDNA) from genome sequencing data can provide insights into nucleosome occupancy in the original cells. In accessible genomic regions, nucleosomes are arranged systematically to facilitate access for DNA-binding proteins.

Pipeline Summary

This pipeline calculates the composite coverage over specific genome regions.

BAM input files are filtered by size (cfDNA fragments have sizes between 90 and 150 bp) and corrected for GC-bias using the method proposed by [Benjamini & Speed (2012). Nucleic Acids Research, 40(10)].

The input BAM is then converted into a coverage file (bigWiggle). At this stage, it is possible to use a blacklist of genomic regions. We are blacklisting the Problematic Regions of the Genome defined by ENCODE. [Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z]

With computeMatrix, we calculate the coverage of the signal (BAM file) on a set of targets (BED file). In our analysis, this is typically a set of Transcription Factor Binding Sites regions. We are using the reference-point mode: we compute the signal distribution relative to a point (the TFBS center point in our case), expanding on both sides for 4kb.

A first plot of the composite coverage on the TFBS target set (raw coverage) is generated by the process plotHeatmap.

The process PEAK_STATS generates 3 output files:

peak_data.tsv file contains the composite coverage (raw) for each bin (bin), the coverage relative to the background median (relative), and the background median (background_median) for each bin.
peak_stats.tsv summary statistics for the composite coverage: signal, target, source, integration (Monte Carlo integration), length of the peak (length), relative length of the peak (rlength), min and max raw values (ymin and ymax), and the ratio between length and integration (ratio).
PeakIntegration.pdf pdf plot of the calculated metrics:

Quick Start

flowchart TB
    subgraph " "
    subgraph params
    v4["input"]
    v2["blacklist_bed"]
    v0["genome_2bit"]
    v6["targets"]
    end
    v9([FRAGMENTOMICS])
    v0 --> v9
    v2 --> v9
    v4 --> v9
    v6 --> v9
    end

INPUT

Create a samplesheet.csv:

caseid,sampleid,timepoint,bam,bai,bw
PATIENT_A,PATIENT_A_T1,T1,/path/to/bam.bam,/path/to/bam.bai,
PATIENT_A,PATIENT_A_T2,T2,/path/to/bam.bam,/path/to/bam.bai,

Where:

caseid is the patient
sampleid is the sample
timepoint is the time group
bam is the input BAM file
bai is the BAM index
bw is the (optional) big wiggle file (preprocessing will be skipped)

TARGETS

Create a target list (targets.csv):

name,source,bed
MYC,GRIFFIN,./tests/input/stub/myc.bed
ELK4,TIMON,./tests/input/stub/elk4.bed
rand1,house_keeping_dataset,./tests/input/stub/rand1.bed
rand2,house_keeping_dataset,./tests/input/stub/rand2.bed
rand3,house_keeping_dataset,./tests/input/stub/rand3.bed
HouseKeeping,house_keeping_dataset,./tests/input/stub/GeneHancer_housekeeping.bed

Here we are defining 6 targets:

MYC and ELK4 transcription factor binding sites from the GRIFFIN dataset.
3 random datasets of random genes for housekeeping plots.
A set of housekeeping genes for random comparisons.

Required parameters:

input: "samplesheet.csv"
targets: "targets.csv"
outdir: "./results"

GENOME

genome_2bit: Genome in two-bit format. Most genomes can be found here: http://hgdownload.cse.ucsc.edu/gbdb/

BLACKLIST BED

blacklist_bed: BED file with blacklisted regions used in COVERAGEBAM (wiggle file generation) and in COMPUTEMATRIX (matrix calculation). We are using the ENCODE blacklist from:

Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z

GENOME SIZE

genome_size: effective genome size used by GC Correction functions (see also: deeptools effective genome size page)

PROFILES

Available profiles (see also conf/profiles.config):

stub: run with stub true
debug: run with debug true
devel: run locally
slurm: for our HPC

Some profile examples:

Stub

Stub run with stub annotation files:

conda activate nf-fragmentomics-env
nextflow run main.nf -params-file examples/input/example_params.yaml -profile stub -stub-run

Devel with conda

Run on single machine with conda environment:

nextflow run main.nf -profile devel,conda -params-file params.yaml

Whole Genome Sequencing with singularity

Run on slurm cluster with singularity

nextflow run main.nf -profile slurm,singularity -params-file params.yaml

BAM PREPROCESSING

flowchart TB
    subgraph BAM_PREPROCESS
    subgraph take
    v0["bam_ch"]
    v2["blacklist_bed"]
    v1["genome_2bit"]
    end
    v3([BAMPEFRAGMENTSIZE])
    v4([PLOTCOVERAGE])
    v5([FILTERBAMBYSIZE])
    v7([COMPUTEGCBIAS])
    v8([CORRECTGCBIAS])
    v10([COVERAGEBAM])
    subgraph emit
    v11["wiggle_ch"]
    end
    v0 --> v3
    v0 --> v4
    v0 --> v5
    v1 --> v7
    v5 --> v7
    v7 --> v8
    v2 --> v10
    v8 --> v10
    v10 --> v11
    end

TARGET PROCESSING

flowchart TB
    subgraph TARGET_PROCESS
    subgraph take
    v0["target_ch"]
    v2["blacklist_bed"]
    v1["wiggle_ch"]
    end
    v4([COMPUTEMATRIX])
    v5([HEATMAP])
    v6([PEAK_STATS])
    v0 --> v4
    v1 --> v4
    v2 --> v4
    v4 --> v5
    v4 --> v6
    end

Documentation

Samplesheet specifications

We tried the analysis with WGS and lpWGS samples.

Input BAM file must be sorted and indexed.

Targets specifications

File name or names, in BED or GTF format, containing the regions to compute and plot.

Required Annotation files

This pipeline use the ENCODE blacklist or other blacklist BED file to remove problematic regions from the analysis.

Param: blacklist_bed

Parameters

preprocess: if true, the pipeline filters BAM by size, applies GC correction, and converts to wiggle file
bin_size: the bin size used to generate the coverage file (big wiggle)
target_expand_sx and target_expand_dx: how many bp to expand the Target region? default is 4000 bp on both sides
filter_min and filter_max: limit for reads filtering. By default is 90-150 bp

Labels

Scripts

Utility scripts in the scripts directory.

SampleSheet Generator

location: scripts/samplesheet_generator.py

usage: samplesheet_generator.py [-h] [-r REGEXP] [-v] [-vv] FILE [FILE ...]

Generate samplesheet.csv for nf-fragmentomics pipeline

positional arguments:
  FILE                  BAM or wiggle files

options:
  -h, --help            show this help message and exit
  -r REGEXP, --regexp REGEXP
                        Parser regexp - default: ((.*)_(.*))\..*
  -v, --verbose         set loglevel to INFO
  -vv, --very-verbose   set loglevel to DEBUG

Author: Davide Rambaldi

Usage example with test files:

./scripts/samplesheet_generator.py tests/input/samplesheet_generator/*/*.{bw,bam}

Targets Generator

location: scripts/targets_generator.py

usage: targets_generator.py [-h] [-v] [-vv] FILE [FILE ...]

Generate targets.csv for nf-fragmentomics pipeline

positional arguments:
  FILE                  BED files

options:
  -h, --help            show this help message and exit
  -v, --verbose         set loglevel to INFO
  -vv, --very-verbose   set loglevel to DEBUG

Author: Davide Rambaldi

Usage example with test files:

./scripts/targets_generator.py tests/input/targets_generator/*.bed

Analysis example

Some analysis examples with pseudocode

Housekeeping plots

To verify if in our dataset it is possible to see a signal from open chromatin regions, we can do the following experiment:

We take all the TSS from GeneHancer.
We calculate the composite coverage (the peak) of all TSS on housekeeping genes (that should always be active).
We then calculate the composite coverage for 100 sets of randomly picked genes NOT housekeeping.

We first need to load the peak_data.tsv files.

These files contain 4 columns:

raw is the raw signal
bin is the x position
relative is the signal relative to the background_median
background_median is the median calculated as shown above

In R, for example, we can do this:

mdata <- read_delim("path/to/peak_data.tsv", show_col_types = FALSE)

We can load the random dataset in an object (random in the next example) and the housekeeping dataset in another object (hk in the next example).

We can now combine the 2 data in a plot with ggplot:

ggplot() +
    geom_line(data = random, aes(x = bin, y = relative), color="grey", show.legend = FALSE) +
    geom_line(data = housekeeping, aes(x = bin, y = relative), color="red", show.legend = FALSE) +
    scale_x_continuous(
      "Position relative to TSS (bp)", 
      breaks = c(0,200,400,600,800), 
      labels = c("-4kb","-2kb","0","2kb","4kb")
    ) +
    ylab("Relative coverage") +
    ggtitle(plot.title) +
    scale_color_manual(values=c("grey","red")) +
    theme(legend.position = "bottom")

The housekeeping signal (red line in the plot) is quite different compared to random gene sets! Do you agree?

Target plots

Here we want to plot the composite coverage for a single Transcription Factor for a cohort across different timepoints.

We start as above from the peak_data.tsv files where:

raw is the raw signal
bin is the x position
relative is the signal relative to the background_median
background_median is the median calculated as shown above

In this case, it is better to load the single file and merge the data into a single object:

For each peak_data file:

mdata <- read_delim("path/to/peak_data.tsv", show_col_types = FALSE)

Mutate rdata to add sample, case, and timepoint:

mdata <- mdata |> mutate(
    sample=sample_name,
    case=case_name,
    timepoint=timepoint_name
)

You can now combine different data objects into a single object using bind_rows.

On the final dataset, we can use the bins on the X axis, the relative signal on the Y axis. We group and color observations by timepoint and we split (facet_wrap) plots by patient.

Result:

Heatmaps with ComplexHeatmap

Given a set of samples and a list of targets, we can build a heatmap of peak lengths or other peak statistics.

In this example, we use ComplexHeatmap and the R package to draw and arrange multiple heatmaps.

Credits

Contributions and Support

To test the R peakStats script:

cd peakStats
../../bin/fragmentomics_peakStats.R -s "Signal" -t "Target" -S "Source" --background-left-limit 50 --background-right-limit 50 ../input/matrix/FOXA2_regions_matrix.gz

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
assets		assets
bin		bin
conf		conf
devel		devel
examples		examples
modules/local		modules/local
scripts		scripts
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO		TODO
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config
nf-test.config		nf-test.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NF-FRAGMENTOMICS PIPELINE

Introduction

Pipeline Summary

Quick Start

INPUT

TARGETS

GENOME

BLACKLIST BED

GENOME SIZE

PROFILES

Stub

Devel with conda

Whole Genome Sequencing with singularity

BAM PREPROCESSING

TARGET PROCESSING

Documentation

Samplesheet specifications

Targets specifications

Required Annotation files

Parameters

Labels

Scripts

SampleSheet Generator

Targets Generator

Analysis example

Housekeeping plots

Target plots

Heatmaps with ComplexHeatmap

Credits

Contributions and Support

Citation

About

Releases

Packages

Languages

License

sottorivalab/nf-fragmentomics

Folders and files

Latest commit

History

Repository files navigation

NF-FRAGMENTOMICS PIPELINE

Introduction

Pipeline Summary

Quick Start

INPUT

TARGETS

GENOME

BLACKLIST BED

GENOME SIZE

PROFILES

Stub

Devel with conda

Whole Genome Sequencing with singularity

BAM PREPROCESSING

TARGET PROCESSING

Documentation

Samplesheet specifications

Targets specifications

Required Annotation files

Parameters

Labels

Scripts

SampleSheet Generator

Targets Generator

Analysis example

Housekeeping plots

Target plots

Heatmaps with ComplexHeatmap

Credits

Contributions and Support

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages