Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XY filtration workflow #191

Open
wants to merge 67 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
73e1e03
add patient_sex to template
Faizal-Eeman Nov 26, 2024
7cd06a6
add XY filter script from project-method-AlgorithmEvaluation-BNCH-000…
Faizal-Eeman Nov 26, 2024
cc5ba70
add par_bed parameter to template.config
Faizal-Eeman Dec 7, 2024
13fb844
add genome build to template
Faizal-Eeman Dec 7, 2024
43fd101
add user input sample id
Faizal-Eeman Dec 19, 2024
cfb4198
add sample sex; remove redundant code; exremove het calls for XY
Faizal-Eeman Dec 19, 2024
6d0ed9e
fix variables
Faizal-Eeman Dec 19, 2024
7abfcf7
extract autosomes
Faizal-Eeman Dec 19, 2024
9417465
merge autosomes and XY filtered calls
Faizal-Eeman Dec 19, 2024
f8417d5
change vcf header extraction location
Faizal-Eeman Dec 19, 2024
055ab3e
Add workflow steps to script note
Faizal-Eeman Dec 19, 2024
6c31337
clean up script
Faizal-Eeman Dec 19, 2024
08213c9
add skeleton code for vcf header temp file
Faizal-Eeman Dec 20, 2024
4be2064
write VCF source to temp file and parameterize it
Faizal-Eeman Dec 21, 2024
ac2d36c
add arg for variant caller
Faizal-Eeman Dec 21, 2024
4ff9575
set variant caller source
Faizal-Eeman Dec 21, 2024
7bdd8cc
improve documentation
Faizal-Eeman Dec 21, 2024
1eadfbb
add hail v0.2.133 and docker image to default config
Faizal-Eeman Dec 21, 2024
24cb06d
add filter-xy NF script
Faizal-Eeman Dec 21, 2024
27b39da
add NF skeleton
Faizal-Eeman Dec 21, 2024
73c7f1e
add script dir var in main
Faizal-Eeman Dec 21, 2024
570677b
add xy filtration command
Faizal-Eeman Dec 21, 2024
d247577
rename XY script
Faizal-Eeman Dec 21, 2024
71c294b
change arg variant_caller to vcf_source_file
Faizal-Eeman Dec 28, 2024
6cb1d9c
add VCF source extraction code to script section in nextflow module
Faizal-Eeman Dec 28, 2024
8b8948d
revert script output to bgz
Faizal-Eeman Dec 28, 2024
6e22838
set publishDir
Faizal-Eeman Dec 28, 2024
f41aab6
add channel for xy filter and call process
Faizal-Eeman Dec 28, 2024
28698a2
include filter XY module in main
Faizal-Eeman Dec 28, 2024
e572c84
simplify xy filter channel
Faizal-Eeman Dec 30, 2024
f43e8f9
add script dir ch
Faizal-Eeman Dec 30, 2024
08e7387
add script dir input to NF module
Faizal-Eeman Dec 30, 2024
c0af897
fix docker tag
Faizal-Eeman Dec 30, 2024
4a0687e
update script command to take vcf file source
Faizal-Eeman Jan 7, 2025
0e552e6
update sample sex parameter in template
Faizal-Eeman Jan 7, 2025
6d78944
update template config
Faizal-Eeman Jan 7, 2025
a99e3bb
add parameters to schema
Faizal-Eeman Jan 7, 2025
a4150b5
add genome build arg to script command
Faizal-Eeman Jan 7, 2025
9bf2c4d
parameterize genome build in script
Faizal-Eeman Jan 7, 2025
5408c83
temporarily add hail dev tag
Faizal-Eeman Jan 7, 2025
b3123f0
add params.par_bed as input
Faizal-Eeman Jan 7, 2025
d026262
add par_bed as process input
Faizal-Eeman Jan 7, 2025
3e0047f
fix output vcf dataset at export in script
Faizal-Eeman Jan 7, 2025
365c331
fix pylint
Faizal-Eeman Jan 8, 2025
f70361b
update docker hail version
Faizal-Eeman Jan 8, 2025
7928c5e
remove cat command
Faizal-Eeman Jan 8, 2025
7ab2a6e
fix log output dir
Faizal-Eeman Jan 8, 2025
29de6e0
standardize process name with tool name at the end
Faizal-Eeman Jan 8, 2025
f779ca5
standardize process name with tool name at the end
Faizal-Eeman Jan 8, 2025
c67cc5c
update sample_sex comment in template
Faizal-Eeman Jan 8, 2025
96768ff
add default and choices for genome_build
Faizal-Eeman Jan 8, 2025
b2cf9e8
add choices for sample_sex in schema
Faizal-Eeman Jan 8, 2025
1e51960
add resource allocation to filter_XY_Hail
Faizal-Eeman Jan 8, 2025
c51d3c3
emit xy filtered output
Faizal-Eeman Jan 8, 2025
a69f46c
generate checksum for xy filtered vqsr VCF
Faizal-Eeman Jan 8, 2025
bd9e781
add system command to VCF header
Faizal-Eeman Jan 9, 2025
002ffc3
Add XY filtration step to README
Faizal-Eeman Jan 9, 2025
f783530
fix process name
Faizal-Eeman Jan 9, 2025
f96c632
add XY filteration params to README
Faizal-Eeman Jan 9, 2025
9b6e036
Update script description
Faizal-Eeman Jan 9, 2025
6e571bc
add xy_filtration_workflow.md
Faizal-Eeman Jan 9, 2025
d9a1eae
add GRCh38 PAR to README
Faizal-Eeman Jan 9, 2025
bd568b6
fix pylint
Faizal-Eeman Jan 9, 2025
8f8326d
fix publishDir rules
Faizal-Eeman Jan 9, 2025
9b06490
Update outputs in README
Faizal-Eeman Jan 9, 2025
ca0e4f4
Upddate CHANGELOG
Faizal-Eeman Jan 9, 2025
e8d1d5d
update output filename
Faizal-Eeman Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

## [Unreleased]
### Added
- Add XY filtration
- NFTest test case

---
Expand Down Expand Up @@ -152,7 +153,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
- Update reheadering to use -c option
- Modularize workflows for different modes (single vs. paired, WGS vs targeted)
- Update GATK to 4.2.4.0 to address Log4j critical vulnerability (https://github.com/advisories/GHSA-jfh8-c2jp-5v3q)
- Update Picard to 2.26.8 to address Log4j critical vulnerability (https://github.com/advisories/GHSA-jfh8-c2jp-5v3q)
- Update Picard to 2.26.8 to address Log4j critical vulnerability (https://github.com/advisories/GHSA-jfh8-c2jp-5v3q)

---

Expand Down
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,10 @@ Take the output from Step 6 as input, and apply the model in Step 5 to recalibra
### 8. Filter gSNP – Filter out ambiguous variants
Use customized Perl script to filter out ambiguous variants.

### 9. Generate sha512 checksum
### 9. Adjust chrX and chrY genotypes based on sample sex from recalibrated VCF
Apply XY filtration workflow to recalibrated VCF as discribed [here](docs/xy_filtration_workflow.md).
Comment on lines +81 to +82
Copy link

@tyamaguchi-ucla tyamaguchi-ucla Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

How does this process work with mouse samples (or other species)? Is this process optional?


### 10. Generate sha512 checksum
Generate sha512 checksum for VCFs and GVCFs.

---
Expand Down Expand Up @@ -115,6 +118,8 @@ For normal-only or tumor-only samples, exclude the fields for the other state.
|:----------------|:---------|:-----|:------------|
| `dataset_id` | Yes | string | Dataset ID |
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now |
| `genome_build` | Yes | string | Genome build, GRCh37 or GRCh38 |
| `sample_sex` | Yes | string | Sample Sex, XY or XX |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking):

@Faizal-Eeman @yashpatel6 We might've touched on this before but have we tried to adjust ploidy for male X Y chromosomes when running HC even before this filtering?

--sample-ploidy 2 \

| `output_dir` | Yes | string | Need to set if `blcds_registered_dataset = false` |
| `save_intermediate_files` | Yes | boolean | Set to false to disable publishing of intermediate files; true otherwise; disabling option will delete intermediate files to allow for processing of large BAMs |
| `cache_intermediate_pipeline_steps` | No | boolean | Set to true to enable process caching from Nextflow; defaults to false |
Expand All @@ -126,6 +131,7 @@ For normal-only or tumor-only samples, exclude the fields for the other state.
| `bundle_hapmap_3p3_vcf_gz` | Yes | path | Absolute path to HapMap 3.3 file, e.g., `/hot/resource/tool-specific-input/GATK/GRCh38/hapmap_3.3.hg38.vcf.gz` |
| `bundle_omni_1000g_2p5_vcf_gz` | Yes | path | Absolute path to 1000 genomes OMNI 2.5 file, e.g., `/hot/resource/tool-specific-input/GATK/GRCh38/1000G_omni2.5.hg38.vcf.gz` |
| `bundle_phase1_1000g_snps_high_conf_vcf_gz` | Yes | path | Absolute path to 1000 genomes phase 1 high-confidence file, e.g., `/hot/resource/tool-specific-input/GATK/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz` |
| `par_bed` | Yes | path | Absolute path to Pseudo-autosomal Region (PAR) BED |
| `work_dir` | optional | path | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively. |
| `docker_container_registry` | optional | string | Registry containing tool Docker images. Default: `ghcr.io/uclahs-cds` |
| `base_resource_update` | optional | namespace | Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in `template.config` and below. |
Expand Down Expand Up @@ -199,6 +205,10 @@ base_resource_update {
| `<GATK>_<dataset_id>_<patient_id>_indel.vcf.gz` | Filtered INDELs with non-germline and ambiguous variants removed |
| `<GATK>_<dataset_id>_<patient_id>_indel.vcf.gz.tbi` | Filtered germline INDELs index |
| `<GATK>_<dataset_id>_<patient_id>_indel.vcf.gz.sha512` | Filtered germline INDELs sha512 checksum |
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz` | chrX/Y filtered SNP and INDEL recalibrated variants |
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.sha512` | chrX/Y filtered SNP and INDEL recalibrated variants checksum |
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.tbi` | chrX/Y filtered SNP and INDEL recalibrated variants index |
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.tbi.sha512` | chrX/Y filtered SNP and INDEL recalibrated variants index checksum |
Comment on lines +208 to +211

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

I recommend removing _filtered from the name. Many of our VCF outputs undergo some form of filtering, so the term might not add significant value here. Have we decided whether these outputs will be the final outputs or supplementary final outputs?

| `report.html`, `timeline.html` and `trace.txt` | Nextflow report, timeline and trace files |
| `*.command.*` | Process specific logging files created by nextflow |

Expand Down
10 changes: 10 additions & 0 deletions config/F16.config
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,14 @@ process {
}
}
}
withName: filter_XY_Hail {
cpus = 1
memory = 2.GB
retry_strategy {
memory {
strategy = 'exponential'
operand = 2
}
}
}
}
10 changes: 10 additions & 0 deletions config/F32.config
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,14 @@ process {
}
}
}
withName: filter_XY_Hail {
cpus = 2
memory = 4.GB
retry_strategy {
memory {
strategy = 'exponential'
operand = 2
}
}
}
}
10 changes: 10 additions & 0 deletions config/F72.config
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,14 @@ process {
}
}
}
withName: filter_XY_Hail {
cpus = 2
memory = 6.GB
retry_strategy {
memory {
strategy = 'exponential'
operand = 2
}
}
}
}
10 changes: 10 additions & 0 deletions config/M64.config
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,14 @@ process {
}
}
}
withName: filter_XY_Hail {
cpus = 4
memory = 10.GB
retry_strategy {
memory {
strategy = 'exponential'
operand = 2
}
}
}
}
4 changes: 3 additions & 1 deletion config/default.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,12 @@ params {
picard_version = "2.26.10"
pipeval_version = "4.0.0-rc.2"
gatkfilter_version = "v1.0.0"
hail_version = "0.2.133"
docker_image_gatk = "broadinstitute/gatk:${params.gatk_version}"
docker_image_picard = "${-> params.docker_container_registry}/picard:${params.picard_version}"
docker_image_pipeval = "${-> params.docker_container_registry}/pipeval:${params.pipeval_version}"
docker_image_gatkfilter = "${-> params.docker_container_registry}/gatk:${params.gatkfilter_version}"
docker_image_hail = "${-> params.docker_container_registry}/hail:${params.hail_version}"

emit_all_confident_sites = false
}
Expand All @@ -36,7 +38,7 @@ process {
cache = true

executor = 'local'

// Other directives or options that should apply for every process

// total amount of resources avaible to the pipeline
Expand Down
21 changes: 21 additions & 0 deletions config/schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,26 @@ patient_id:
type: 'String'
required: true
help: 'Patient ID'
sample_sex:
type: 'String'
required: true
help: 'Sample Sex'
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
choices:
- "XY"
- "XX"
dataset_id:
type: 'String'
required: true
help: 'Dataset ID'
genome_build:
type: 'String'
required: true
help: 'Genome build, GRCh37 or GRCh38'
default:
- "GRCh38"
choice:
- "GRCh37"
- "GRCh38"
output_dir:
type: 'Path'
mode: 'w'
Expand Down Expand Up @@ -62,6 +78,11 @@ bundle_phase1_1000g_snps_high_conf_vcf_gz:
mode: 'r'
required: true
help: 'Absolute path to high-confidence 1000g SNPs VCF'
par_bed:
type: 'Path'
mode: 'r'
required: true
help: 'Absolute path to Pseudo-autosomal Region (PAR) BED'
base_resource_update:
type: 'ResourceUpdateNamespace'
required: false
Expand Down
8 changes: 8 additions & 0 deletions config/template.config
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ params {
dataset_id = ''
blcds_registered_dataset = false // if you want the output to be registered

genome_build = "GRCh38"

// Input patient sex
sample_sex = '' // 'XY' or 'XX'

output_dir = '/path/to/output/directory'

// Set to false to disable the publish rule and delete intermediate files as they're no longer needed
Expand Down Expand Up @@ -43,6 +48,9 @@ params {
bundle_omni_1000g_2p5_vcf_gz = "/hot/resource/tool-specific-input/GATK/GRCh38/1000G_omni2.5.hg38.vcf.gz"
bundle_phase1_1000g_snps_high_conf_vcf_gz = "/hot/resource/tool-specific-input/GATK/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz"

// Specify BED file path for Pseudoautosomal Region (PAR)
par_bed = ""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a standardized reference in /hot/resource/ ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer this to @yashpatel6 as I do not have permission to create a dir in /hot/resource/

Here's the GRCh38 version of PAR BED. You can remove the commented lines from this file when you make a copy in /hot/resource/ - /hot/project/method/AlgorithmEvaluation/BNCH-000122-GIABSexChrGermlineFilter/GIAB/AshkenazimTrio/germline-small-variant/filter_XY/pseudoautosomal_regions_hg38.bed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be moved to the /hot/resource, can you make an issue to request the reference dataset moving at https://github.com/uclahs-cds/group-dataset-standardization/issues/new/choose?


// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}
Expand Down
26 changes: 26 additions & 0 deletions docs/xy_filtration_workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Filter XY calls from a germline VCF file

## Steps:
1. Extract autosomes and chrX/Y variants from input VCF
2. Filter chrX/Y variants
3. Merge autosomal and filtered chrX/Y variants

## chrX/Y Filter Criteria:
- Extract chrX/Y calls
- Extract chrX/Y calls overlapping with Pseudo-Autosomal Regions (PARs)
- For non-PAR chrX/Y calls
- if `sample_sex` is `XY`:
- Filter out heterozygous `GT` calls in chrX and chrY
- Transform homozygous `GT=1/1` to hemizygous `GT=1`
- if `sample_sex` is `XX`:
- Filter out `chrY` calls

## Pseudo-Autosomal Regions (PARs)
### GRCh38
| CHROM | START | END | PAR | REGION | REFERENCE |
|---|---|---|---|---|---|
| chrX | 10001 | 2781479 | PAR1 | Xp22 | EMSEMBL |
| chrX | 91434839 | 91438584 | PAR3/XTR | Xq21.3 | PMID:23708688 |
| chrX | 155701383 | 156030895 | PAR2 | Xq28 | ENSEMBL |
| chrY | 10001 | 10300000 | PAR1+PAR3/XTR | Yp11 | ENSEMBL +PMID:23708688 |
| chrY | 56887903 | 57217415 | PAR2 | Yq12 | ENSEMBL |
22 changes: 22 additions & 0 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ include {
} from './module/merge-vcf.nf'
include { recalibrate_variants } from './module/workflow-recalibrate-variants.nf'
include { filter_gSNP_GATK } from './module/filter-gsnp.nf'
include { filter_XY_Hail } from './module/filter-xy.nf'
include { calculate_sha512 } from './module/checksum.nf'

// Returns the index file for the given bam or vcf
Expand Down Expand Up @@ -104,6 +105,12 @@ workflow {
}
.set{ input_ch_collected_files }

script_dir_ch = Channel.fromPath(
"$projectDir/script",
checkIfExists: true
)
.collect()

/**
* Input validation
*/
Expand Down Expand Up @@ -248,13 +255,28 @@ workflow {
recalibrate_variants.out.output_ch_recalibrated_variants
)

filter_xy_ch = recalibrate_variants.out.output_ch_recalibrated_variants
.map { it -> [it[0], it[1], it[2]] }

script_dir_ch = Channel.fromPath(
"$projectDir/script",
checkIfExists: true
)
.collect()

filter_XY_Hail(
filter_xy_ch,
params.par_bed,
script_dir_ch
)
/**
* Calculate checksums for output files
*/
run_MergeVcfs_Picard_VCF.out.merged_vcf
.mix(run_MergeVcfs_Picard_GVCF.out.merged_vcf)
.mix(recalibrate_variants.out.output_ch_recalibrated_variants)
.map{ [it[1], it[2]] }
.mix(filter_XY_Hail.out.xy_filtered_vqsr)
.mix(filter_gSNP_GATK.out.germline_filtered)
.flatten()
.set{ input_ch_calculate_checksum }
Expand Down
63 changes: 63 additions & 0 deletions module/filter-xy.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
include { generate_standard_filename; sanitize_string } from '../external/pipeline-Nextflow-module/modules/common/generate_standardized_filename/main.nf'

/*
Nextflow module for filtering chrX and chrY variant calls based on sample sex

input:
sample_id: identifier for sample
sample_vcf: path to VCF to filter
sample_vcf_tbi: path to index of VCF to filter

params:
params.output_dir_base: string(path)
params.log_output_dir: string(path)
params.docker_image_hail: string
params.sample_sex: string
params.par_bed: string(path)
*/

process filter_XY_Hail {
container params.docker_image_hail

publishDir path: "${params.output_dir_base}/output",
mode: "copy",
pattern: '*.vcf.bgz*'

publishDir path: "${params.log_output_dir}/process-log",
pattern: ".command.*",
mode: "copy",
saveAs: {
"${task.process.replace(':', '/')}-${sample_id}/log${file(it).getName()}"
}

input:
tuple val(sample_id), path(recalibrated_vcf), path(recalibrated_vcf_tbi)
path(par_bed)
path(script_dir)

output:
path(".command.*")
tuple path("${output_filename}_XY_filtered.vcf.bgz"), path("${output_filename}_XY_filtered.vcf.bgz.tbi"), emit: xy_filtered_vqsr

script:
output_filename = generate_standard_filename(
"Hail-${params.hail_version}",
params.dataset_id,
sample_id,
[additional_tools:["GATK-${params.gatk_version}"]]
)
"""
set -euo pipefail

zgrep "##source=" ${recalibrated_vcf} > ./vcf_source.txt

python ${script_dir}/filter_xy_call.py \
--sample_name ${output_filename} \
--input_vcf ${recalibrated_vcf} \
--vcf_source_file ./vcf_source.txt \
--sample_sex ${params.sample_sex} \
Faizal-Eeman marked this conversation as resolved.
Show resolved Hide resolved
--par_bed ${par_bed} \
--genome_build ${params.genome_build} \
--output_dir .
"""
}
Loading
Loading