-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XY filtration workflow #191
base: main
Are you sure you want to change the base?
Changes from all commits
73e1e03
7cd06a6
cc5ba70
13fb844
43fd101
cfb4198
6d0ed9e
7abfcf7
9417465
f8417d5
055ab3e
6c31337
08213c9
4be2064
ac2d36c
4ff9575
7bdd8cc
1eadfbb
24cb06d
27b39da
73c7f1e
570677b
d247577
71c294b
6cb1d9c
8b8948d
6e22838
f41aab6
28698a2
e572c84
f43e8f9
08e7387
c0af897
4a0687e
0e552e6
6d78944
a99e3bb
a4150b5
9bf2c4d
5408c83
b3123f0
d026262
3e0047f
365c331
f70361b
7928c5e
7ab2a6e
29de6e0
f779ca5
c67cc5c
96768ff
b2cf9e8
1e51960
c51d3c3
a69f46c
bd9e781
002ffc3
f783530
f96c632
9b6e036
6e571bc
d9a1eae
bd568b6
8f8326d
9b06490
ca0e4f4
e8d1d5d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -78,7 +78,10 @@ Take the output from Step 6 as input, and apply the model in Step 5 to recalibra | |||
### 8. Filter gSNP – Filter out ambiguous variants | ||||
Use customized Perl script to filter out ambiguous variants. | ||||
|
||||
### 9. Generate sha512 checksum | ||||
### 9. Adjust chrX and chrY genotypes based on sample sex from recalibrated VCF | ||||
Apply XY filtration workflow to recalibrated VCF as discribed [here](docs/xy_filtration_workflow.md). | ||||
|
||||
### 10. Generate sha512 checksum | ||||
Generate sha512 checksum for VCFs and GVCFs. | ||||
|
||||
--- | ||||
|
@@ -115,6 +118,8 @@ For normal-only or tumor-only samples, exclude the fields for the other state. | |||
|:----------------|:---------|:-----|:------------| | ||||
| `dataset_id` | Yes | string | Dataset ID | | ||||
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now | | ||||
| `genome_build` | Yes | string | Genome build, GRCh37 or GRCh38 | | ||||
| `sample_sex` | Yes | string | Sample Sex, XY or XX | | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question (non-blocking): @Faizal-Eeman @yashpatel6 We might've touched on this before but have we tried to adjust ploidy for male X Y chromosomes when running HC even before this filtering?
|
||||
| `output_dir` | Yes | string | Need to set if `blcds_registered_dataset = false` | | ||||
| `save_intermediate_files` | Yes | boolean | Set to false to disable publishing of intermediate files; true otherwise; disabling option will delete intermediate files to allow for processing of large BAMs | | ||||
| `cache_intermediate_pipeline_steps` | No | boolean | Set to true to enable process caching from Nextflow; defaults to false | | ||||
|
@@ -126,6 +131,7 @@ For normal-only or tumor-only samples, exclude the fields for the other state. | |||
| `bundle_hapmap_3p3_vcf_gz` | Yes | path | Absolute path to HapMap 3.3 file, e.g., `/hot/resource/tool-specific-input/GATK/GRCh38/hapmap_3.3.hg38.vcf.gz` | | ||||
| `bundle_omni_1000g_2p5_vcf_gz` | Yes | path | Absolute path to 1000 genomes OMNI 2.5 file, e.g., `/hot/resource/tool-specific-input/GATK/GRCh38/1000G_omni2.5.hg38.vcf.gz` | | ||||
| `bundle_phase1_1000g_snps_high_conf_vcf_gz` | Yes | path | Absolute path to 1000 genomes phase 1 high-confidence file, e.g., `/hot/resource/tool-specific-input/GATK/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz` | | ||||
| `par_bed` | Yes | path | Absolute path to Pseudo-autosomal Region (PAR) BED | | ||||
| `work_dir` | optional | path | Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively. | | ||||
| `docker_container_registry` | optional | string | Registry containing tool Docker images. Default: `ghcr.io/uclahs-cds` | | ||||
| `base_resource_update` | optional | namespace | Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in `template.config` and below. | | ||||
|
@@ -199,6 +205,10 @@ base_resource_update { | |||
| `<GATK>_<dataset_id>_<patient_id>_indel.vcf.gz` | Filtered INDELs with non-germline and ambiguous variants removed | | ||||
| `<GATK>_<dataset_id>_<patient_id>_indel.vcf.gz.tbi` | Filtered germline INDELs index | | ||||
| `<GATK>_<dataset_id>_<patient_id>_indel.vcf.gz.sha512` | Filtered germline INDELs sha512 checksum | | ||||
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz` | chrX/Y filtered SNP and INDEL recalibrated variants | | ||||
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.sha512` | chrX/Y filtered SNP and INDEL recalibrated variants checksum | | ||||
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.tbi` | chrX/Y filtered SNP and INDEL recalibrated variants index | | ||||
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.tbi.sha512` | chrX/Y filtered SNP and INDEL recalibrated variants index checksum | | ||||
Comment on lines
+208
to
+211
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: I recommend removing |
||||
| `report.html`, `timeline.html` and `trace.txt` | Nextflow report, timeline and trace files | | ||||
| `*.command.*` | Process specific logging files created by nextflow | | ||||
|
||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,11 @@ params { | |
dataset_id = '' | ||
blcds_registered_dataset = false // if you want the output to be registered | ||
|
||
genome_build = "GRCh38" | ||
|
||
// Input patient sex | ||
sample_sex = '' // 'XY' or 'XX' | ||
|
||
output_dir = '/path/to/output/directory' | ||
|
||
// Set to false to disable the publish rule and delete intermediate files as they're no longer needed | ||
|
@@ -43,6 +48,9 @@ params { | |
bundle_omni_1000g_2p5_vcf_gz = "/hot/resource/tool-specific-input/GATK/GRCh38/1000G_omni2.5.hg38.vcf.gz" | ||
bundle_phase1_1000g_snps_high_conf_vcf_gz = "/hot/resource/tool-specific-input/GATK/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz" | ||
|
||
// Specify BED file path for Pseudoautosomal Region (PAR) | ||
par_bed = "" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this be a standardized reference in /hot/resource/ ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll defer this to @yashpatel6 as I do not have permission to create a dir in Here's the GRCh38 version of PAR BED. You can remove the commented lines from this file when you make a copy in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be moved to the /hot/resource, can you make an issue to request the reference dataset moving at https://github.com/uclahs-cds/group-dataset-standardization/issues/new/choose? |
||
|
||
// Base resource allocation updater | ||
// See README for adding parameters to update the base resource allocations | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Filter XY calls from a germline VCF file | ||
|
||
## Steps: | ||
1. Extract autosomes and chrX/Y variants from input VCF | ||
2. Filter chrX/Y variants | ||
3. Merge autosomal and filtered chrX/Y variants | ||
|
||
## chrX/Y Filter Criteria: | ||
- Extract chrX/Y calls | ||
- Extract chrX/Y calls overlapping with Pseudo-Autosomal Regions (PARs) | ||
- For non-PAR chrX/Y calls | ||
- if `sample_sex` is `XY`: | ||
- Filter out heterozygous `GT` calls in chrX and chrY | ||
- Transform homozygous `GT=1/1` to hemizygous `GT=1` | ||
- if `sample_sex` is `XX`: | ||
- Filter out `chrY` calls | ||
|
||
## Pseudo-Autosomal Regions (PARs) | ||
### GRCh38 | ||
| CHROM | START | END | PAR | REGION | REFERENCE | | ||
|---|---|---|---|---|---| | ||
| chrX | 10001 | 2781479 | PAR1 | Xp22 | EMSEMBL | | ||
| chrX | 91434839 | 91438584 | PAR3/XTR | Xq21.3 | PMID:23708688 | | ||
| chrX | 155701383 | 156030895 | PAR2 | Xq28 | ENSEMBL | | ||
| chrY | 10001 | 10300000 | PAR1+PAR3/XTR | Yp11 | ENSEMBL +PMID:23708688 | | ||
| chrY | 56887903 | 57217415 | PAR2 | Yq12 | ENSEMBL | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
include { generate_standard_filename; sanitize_string } from '../external/pipeline-Nextflow-module/modules/common/generate_standardized_filename/main.nf' | ||
|
||
/* | ||
Nextflow module for filtering chrX and chrY variant calls based on sample sex | ||
|
||
input: | ||
sample_id: identifier for sample | ||
sample_vcf: path to VCF to filter | ||
sample_vcf_tbi: path to index of VCF to filter | ||
|
||
params: | ||
params.output_dir_base: string(path) | ||
params.log_output_dir: string(path) | ||
params.docker_image_hail: string | ||
params.sample_sex: string | ||
params.par_bed: string(path) | ||
*/ | ||
|
||
process filter_XY_Hail { | ||
container params.docker_image_hail | ||
|
||
publishDir path: "${params.output_dir_base}/output", | ||
mode: "copy", | ||
pattern: '*.vcf.bgz*' | ||
|
||
publishDir path: "${params.log_output_dir}/process-log", | ||
pattern: ".command.*", | ||
mode: "copy", | ||
saveAs: { | ||
"${task.process.replace(':', '/')}-${sample_id}/log${file(it).getName()}" | ||
} | ||
|
||
input: | ||
tuple val(sample_id), path(recalibrated_vcf), path(recalibrated_vcf_tbi) | ||
path(par_bed) | ||
path(script_dir) | ||
|
||
output: | ||
path(".command.*") | ||
tuple path("${output_filename}_XY_filtered.vcf.bgz"), path("${output_filename}_XY_filtered.vcf.bgz.tbi"), emit: xy_filtered_vqsr | ||
|
||
script: | ||
output_filename = generate_standard_filename( | ||
"Hail-${params.hail_version}", | ||
params.dataset_id, | ||
sample_id, | ||
[additional_tools:["GATK-${params.gatk_version}"]] | ||
) | ||
""" | ||
set -euo pipefail | ||
|
||
zgrep "##source=" ${recalibrated_vcf} > ./vcf_source.txt | ||
|
||
python ${script_dir}/filter_xy_call.py \ | ||
--sample_name ${output_filename} \ | ||
--input_vcf ${recalibrated_vcf} \ | ||
--vcf_source_file ./vcf_source.txt \ | ||
--sample_sex ${params.sample_sex} \ | ||
Faizal-Eeman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--par_bed ${par_bed} \ | ||
--genome_build ${params.genome_build} \ | ||
--output_dir . | ||
""" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question:
How does this process work with mouse samples (or other species)? Is this process optional?