Merge pull request #8 from oicr-gsi/track-vcf-name

Track vcf name
oicr-gsi · Jan 16, 2023 · b12ef78 · b12ef78
2 parents 6cf34f6 + f05ce99
commit b12ef78
Show file tree

Hide file tree

Showing 9 changed files with 292 additions and 561 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,12 @@
 # CHANGELOG
 
-## 1.0 - 2022-11-21
+## 0.3 - unreleased
+- moved mrdetect algorithm scripts to bitbucket
+- seperated filterVCF function from detectSNVs
+- added SNP.count and VAF files
+- changed filterAndDetect command to match code updates (v1.1.1)
+
+## 0.2 - 2022-11-21
 - workflow finalized for production
 
 ## 0.0 - 2022-10-14

diff --git a/README.md b/README.md
@@ -39,118 +39,111 @@ Parameter|Value|Default|Description
 #### Optional task parameters:
 Parameter|Value|Default|Description
 ---|---|---|---
-`detectSample.modules`|String|"mrdetect/1.0 bcftools/1.9 hg38/p12 hg38-dac-exclusion/1.0 mrdetect-scripts/1.1 pwgs-blocklist/hg38_1"|Required environment modules
-`detectSample.jobMemory`|Int|64|Memory allocated for this job (GB)
-`detectSample.threads`|Int|4|Requested CPU threads
-`detectSample.timeout`|Int|10|Hours before task timeout
-`detectSample.tumorVCFfilter`|String|"FILTER~'haplotype' | FILTER~'clustered_events' | FILTER~'slippage' | FILTER~'weak_evidence' | FILTER~'strand_bias' | FILTER~'position' | FILTER~'normal_artifact' | FILTER~'multiallelic' | FILTER~'map_qual' | FILTER~'germline' | FILTER~'fragment' | FILTER~'contamination' | FILTER~'base_qual'"|set of filter calls to incl. in tumor VCF (any line with these flags will be included
-`detectSample.tumorVAF`|String|"0.1"|Variant Allele Frequency for tumor VCF
-`detectSample.pickle`|String|"$MRDETECT_ROOT/MRDetect-master/MRDetectSNV/trained_SVM.pkl"|trained pickle for detecting real tumor reads
-`detectSample.blocklist`|String|"$PWGS_BLOCKLIST_ROOT/blocklist.vcf.gz"|list of sites to exclude from analysis, gzipped
-`detectSample.genome`|String|"$HG38_ROOT/hg38_random.fa"|Path to loaded genome .fa
-`detectSample.difficultRegions`|String|"--regions-file $HG38_DAC_EXCLUSION_ROOT/hg38-dac-exclusion.v2.bed"|Path to .bed excluding difficult regions, string must include the flag --regions-file 
-`detectSample.filterAndDetectScript`|String|"$MRDETECT_ROOT/bin/filterAndDetect"|location of filter and detect script
+`filterVCF.tumorVCFfilter`|String|"FILTER~'haplotype' | FILTER~'clustered_events' | FILTER~'slippage' | FILTER~'weak_evidence' | FILTER~'strand_bias' | FILTER~'position' | FILTER~'normal_artifact' | FILTER~'multiallelic' | FILTER~'map_qual' | FILTER~'germline' | FILTER~'fragment' | FILTER~'contamination' | FILTER~'base_qual'"|set of filter calls to incl. in tumor VCF (any line with these flags will be included
+`filterVCF.tumorVAF`|String|"0.1"|Variant Allele Frequency for tumor VCF
+`filterVCF.genome`|String|"$HG38_ROOT/hg38_random.fa"|Path to loaded genome .fa
+`filterVCF.difficultRegions`|String|"--regions-file $HG38_DAC_EXCLUSION_ROOT/hg38-dac-exclusion.v2.bed"|Path to .bed excluding difficult regions, string must include the flag --regions-file 
+`filterVCF.modules`|String|"bcftools/1.9 hg38/p12 hg38-dac-exclusion/1.0"|Required environment modules
+`filterVCF.jobMemory`|Int|64|Memory allocated for this job (GB)
+`filterVCF.threads`|Int|4|Requested CPU threads
+`filterVCF.timeout`|Int|10|Hours before task timeout
 `parseControls.jobMemory`|Int|4|Memory for this task in GB
 `parseControls.timeout`|Int|12|Timeout in hours, needed to override imposed limits
-`detectControl.modules`|String|"mrdetect/1.0 bcftools/1.9 hg38/p12 hg38-dac-exclusion/1.0 mrdetect-scripts/1.1 pwgs-blocklist/hg38_1"|Required environment modules
+`detectControl.plasmaSampleName`|String|basename(plasmabam,".bam")|name for plasma sample (from bam)
+`detectControl.tumorSampleName`|String|basename(tumorvcf,".vcf")|name for tumour sample (from vcf)
+`detectControl.modules`|String|"mrdetect/1.1.1 pwgs-blocklist/hg38.1"|Required environment modules
 `detectControl.jobMemory`|Int|64|Memory allocated for this job (GB)
 `detectControl.threads`|Int|4|Requested CPU threads
-`detectControl.timeout`|Int|10|Hours before task timeout
-`detectControl.tumorVCFfilter`|String|"FILTER~'haplotype' | FILTER~'clustered_events' | FILTER~'slippage' | FILTER~'weak_evidence' | FILTER~'strand_bias' | FILTER~'position' | FILTER~'normal_artifact' | FILTER~'multiallelic' | FILTER~'map_qual' | FILTER~'germline' | FILTER~'fragment' | FILTER~'contamination' | FILTER~'base_qual'"|set of filter calls to incl. in tumor VCF (any line with these flags will be included
-`detectControl.tumorVAF`|String|"0.1"|Variant Allele Frequency for tumor VCF
-`detectControl.pickle`|String|"$MRDETECT_ROOT/MRDetect-master/MRDetectSNV/trained_SVM.pkl"|trained pickle for detecting real tumor reads
+`detectControl.timeout`|Int|20|Hours before task timeout
+`detectControl.pickle`|String|"$MRDETECT_ROOT/bin/MRDetectSNV/trained_SVM.pkl"|trained pickle for detecting real tumor reads
 `detectControl.blocklist`|String|"$PWGS_BLOCKLIST_ROOT/blocklist.vcf.gz"|list of sites to exclude from analysis, gzipped
-`detectControl.genome`|String|"$HG38_ROOT/hg38_random.fa"|Path to loaded genome .fa
-`detectControl.difficultRegions`|String|"--regions-file $HG38_DAC_EXCLUSION_ROOT/hg38-dac-exclusion.v2.bed"|Path to .bed excluding difficult regions, string must include the flag --regions-file 
-`detectControl.filterAndDetectScript`|String|"$MRDETECT_ROOT/bin/filterAndDetect"|location of filter and detect script
-`snvDetectionSummary.DetectionRScript`|String|"$MRDETECT_SCRIPTS_ROOT/bin/pwg_test.R"|location of pwg_test.R
+`detectControl.pullreadsScript`|String|"$MRDETECT_ROOT/bin/pull_reads"|pull_reads.py executable
+`detectControl.qualityscoreScript`|String|"$MRDETECT_ROOT/bin/quality_score"|quality_score.py executable
+`detectControl.filterAndDetectScript`|String|"$MRDETECT_ROOT/bin/filterAndDetect"|filterAndDetect.py executable
+`detectSample.plasmaSampleName`|String|basename(plasmabam,".bam")|name for plasma sample (from bam)
+`detectSample.tumorSampleName`|String|basename(tumorvcf,".vcf")|name for tumour sample (from vcf)
+`detectSample.modules`|String|"mrdetect/1.1.1 pwgs-blocklist/hg38.1"|Required environment modules
+`detectSample.jobMemory`|Int|64|Memory allocated for this job (GB)
+`detectSample.threads`|Int|4|Requested CPU threads
+`detectSample.timeout`|Int|20|Hours before task timeout
+`detectSample.pickle`|String|"$MRDETECT_ROOT/bin/MRDetectSNV/trained_SVM.pkl"|trained pickle for detecting real tumor reads
+`detectSample.blocklist`|String|"$PWGS_BLOCKLIST_ROOT/blocklist.vcf.gz"|list of sites to exclude from analysis, gzipped
+`detectSample.pullreadsScript`|String|"$MRDETECT_ROOT/bin/pull_reads"|pull_reads.py executable
+`detectSample.qualityscoreScript`|String|"$MRDETECT_ROOT/bin/quality_score"|quality_score.py executable
+`detectSample.filterAndDetectScript`|String|"$MRDETECT_ROOT/bin/filterAndDetect"|filterAndDetect.py executable
+`snvDetectionSummary.pvalue`|String|0.01|p-value for HBC error rate
 `snvDetectionSummary.jobMemory`|Int|20|Memory allocated for this job (GB)
 `snvDetectionSummary.threads`|Int|1|Requested CPU threads
 `snvDetectionSummary.timeout`|Int|2|Hours before task timeout
-`snvDetectionSummary.modules`|String|"mrdetect-scripts/1.1"|Required environment modules
+`snvDetectionSummary.modules`|String|"mrdetect/1.1.1"|Required environment modules
+`snvDetectionSummary.pwgtestscript`|String|"$MRDETECT_ROOT/bin/pwg_test"|executable of pwg_test.R
 
 
 ### Outputs
 
 Output | Type | Description
 ---|---|---
-`snvDetectionFinalResult`|File?|Result from SNV detection for sample
-`snvDetectionHBCResult`|File|Result from SNV detection for sample HBCs
+`snvDetectionResult`|File|Result from SNV detection incl sample HBCs
 `pWGS_svg`|File|pWGS svg
-`stats_json`|File|Final JSON file of mrdetect results
+`snpcount`|File|number of SNPs in vcf after filtering
+`snvDetectionVAF`|File?|VAF from SNV detection for sample
+`final_call`|File|Final file of mrdetect results
 
 
 ## Commands
- This section lists commands run by the MRDetect workflow
+ This section lists commands run by the MRDetect workflow.
 
- ### detectSNVs
- Performs vcf Filtering, followed by processing of individual `MRDetect` calls. Filters include removing difficult regions (optional), splitting multiallelic loci into one allele per line, removing indels, removing loci by quality metrics (set by `tumorVCFfilter`) and finally removing SNPs by VAF (set by `tumorVAF`). Then `MRDetect` proceed across three steps. This task is run through for the sample and for all the controls.
-
- 1- `pull_reads` takes any reads in the plasma .bam that corresponds to a SNP in the solid-tumour .vcf. 
-
- 2- `quality_score` assesses the likelihood that any read is plasma based on the quality score and the trained pickle. 
-
- 3- `filterAndDetect` keeps reads with high plasma likehood and removed those for which SNPs are in the blocklist. Blocklist is created from HBCs and must be copied into the working directory because the path and file name are hard coded.
-
- 4- optionally, reads from filterAndDetect can be printed to a file called detection_output (and processed by `awk`), using the edited version of the script `filterAndDetect.print.py`
-
-
-
- 		set -euo pipefail
-
- 		tabix -fp vcf ~{tumorvcf}
+ ### filterVCF
+ Performs vcf Filtering, followed by processing of individual `MRDetect` calls. Filters include removing difficult regions (optional), splitting multiallelic loci into one allele per line, removing indels, removing loci by quality metrics (set by `tumorVCFfilter`) and finally removing SNPs by VAF (set by `tumorVAF`).
 
  		$BCFTOOLS_ROOT/bin/bcftools view -s ~{tumorbasename} ~{difficultRegions} ~{tumorvcf} |\
  		$BCFTOOLS_ROOT/bin/bcftools norm --multiallelics - --fasta-ref ~{genome}  |\
  		$BCFTOOLS_ROOT/bin/bcftools filter -i "TYPE='snps'" |\
  		$BCFTOOLS_ROOT/bin/bcftools filter -e "~{tumorVCFfilter}" |\
  		$BCFTOOLS_ROOT/bin/bcftools filter -i "(FORMAT/AD[0:1])/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= ~{tumorVAF}" >~{tumorbasename}.SNP.vcf
 
- 		$MRDETECT_ROOT/bin/pull_reads \
- 			--bam ~{plasmabam} \
- 			--vcf ~{tumorbasename}.SNP.vcf \
- 			--out ~{plasmabasename}_PLASMA_VS_TUMOR
-
- 		$MRDETECT_ROOT/bin/quality_score \
- 			--pickle-name ~{pickle} \
- 			--detections ~{plasmabasename}_PLASMA_VS_TUMOR.tsv \
- 			--output_file ~{plasmabasename}_PLASMA_VS_TUMOR.svm.tsv
-
- 		cp ~{blocklist} ./blacklist.txt.gz
-
- 		~{filterAndDetectScript} \
- 			~{tumorbasename}.SNP.vcf \
- 			~{plasmabasename}_PLASMA_VS_TUMOR.svm.tsv \
- 			~{plasmabasename}_PLASMA_VS_TUMOR_RESULT.csv >detection_output.txt
+ ### parseControls
+ This command processes the list of control files as paired bam/bai files and prints them out for detection.
 
- 		awk '$1 ~ "chr" {print $1"\t"$2"\t"$3"\t"$4}' detection_output.txt | uniq -c >detectionsPerSite.txt
+ ### detectSNVs
+ `MRDetect` proceed across three steps. This task is run through for the sample and for all the controls.
 
+ 1- `pull_reads` takes any reads in the plasma .bam that corresponds to a SNP in the solid-tumour .vcf. 
 
+ 2- `quality_score` assesses the likelihood that any read is plasma based on the quality score and the trained pickle. 
 
- ### parseControls
- This command processes the list of control files as paired bam/bai files and prints them out for detection.
+ 3- `filterAndDetect` keeps reads with high plasma likehood and removed those for which SNPs are in the blocklist. 
 
- 		python 
- 		import os, re
+ 		$MRDETECT_ROOT/bin/pull_reads \
+ 			--bam ~{plasmabam} \
+ 			--vcf ~{tumorvcf} \
+ 			--out PLASMA_VS_TUMOR.tsv
 
- 		with open("~{controlFileListLoc}") as f:
- 			for line in f:
- 				line = line.rstrip()
- 				tmp = line.split("\t")
- 				r = tmp[0] + "\t" + tmp[1]
- 				print(r)
- 		f.close()
+ 		$MRDETECT_ROOT/bin/quality_score \
+ 			--pickle-name ~{pickle} \
+ 			--detections PLASMA_VS_TUMOR.tsv \
+ 			--output_file PLASMA_VS_TUMOR.svm.tsv
+
+ 		$MRDETECT_ROOT/bin/filterAndDetect \
+ 			--vcfid ~{tumorSampleName} --bamid ~{plasmaSampleName} \
+ 			--svm PLASMA_VS_TUMOR.svm.tsv \
+ 			--vcf ~{tumorvcf} \
+ 			--output ./ \
+ 			--blocklist ~{blocklist} \
+ 			--troubleshoot
 
-
  ### snvDetectionSummary
 
  Finally, `pwg_test.R` will process the controls and the sample to make a final call. 	
 
- 		set -euo pipefail
-
  		cat ~{sep=' ' controlCalls} | awk '$1 !~ "BAM" {print}' >~{samplebasename}.HBCs.txt
 
- 		Rscript --vanilla ~{DetectionRScript} -s ~{sampleCalls} -S ~{samplebasename} -c ~{samplebasename}.HBCs.txt
+ 		$MRDETECT_ROOT/bin/pwg_test  \
+ 			--sampleName ~{outputFileNamePrefix} \
+ 			--results ~{outputFileNamePrefix}.HBCs.csv \
+ 			--candidateSNVsCountFile ~{snpcount} \
+ 			--vafFile ~{vafFile} \
+ 			--pval ~{pvalue} 
  ## Support
 
 For support, please file an issue on the [Github project](https://github.com/oicr-gsi) or send an email to [email protected] .

diff --git a/commands.txt b/commands.txt
@@ -1,71 +1,54 @@
 ## Commands
-This section lists commands run by the MRDetect workflow
-
-### detectSNVs
-Performs vcf Filtering, followed by processing of individual `MRDetect` calls. Filters include removing difficult regions (optional), splitting multiallelic loci into one allele per line, removing indels, removing loci by quality metrics (set by `tumorVCFfilter`) and finally removing SNPs by VAF (set by `tumorVAF`). Then `MRDetect` proceed across three steps. This task is run through for the sample and for all the controls.
-
-1- `pull_reads` takes any reads in the plasma .bam that corresponds to a SNP in the solid-tumour .vcf. 
-
-2- `quality_score` assesses the likelihood that any read is plasma based on the quality score and the trained pickle. 
-
-3- `filterAndDetect` keeps reads with high plasma likehood and removed those for which SNPs are in the blocklist. Blocklist is created from HBCs and must be copied into the working directory because the path and file name are hard coded.
-
-4- optionally, reads from filterAndDetect can be printed to a file called detection_output (and processed by `awk`), using the edited version of the script `filterAndDetect.print.py`
+This section lists commands run by the MRDetect workflow.
 
-
-
-		set -euo pipefail
-
-		tabix -fp vcf ~{tumorvcf}
+### filterVCF
+Performs vcf Filtering, followed by processing of individual `MRDetect` calls. Filters include removing difficult regions (optional), splitting multiallelic loci into one allele per line, removing indels, removing loci by quality metrics (set by `tumorVCFfilter`) and finally removing SNPs by VAF (set by `tumorVAF`).
 
 		$BCFTOOLS_ROOT/bin/bcftools view -s ~{tumorbasename} ~{difficultRegions} ~{tumorvcf} |\
 		$BCFTOOLS_ROOT/bin/bcftools norm --multiallelics - --fasta-ref ~{genome}  |\
 		$BCFTOOLS_ROOT/bin/bcftools filter -i "TYPE='snps'" |\
 		$BCFTOOLS_ROOT/bin/bcftools filter -e "~{tumorVCFfilter}" |\
 		$BCFTOOLS_ROOT/bin/bcftools filter -i "(FORMAT/AD[0:1])/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= ~{tumorVAF}" >~{tumorbasename}.SNP.vcf
 
-		$MRDETECT_ROOT/bin/pull_reads \
-			--bam ~{plasmabam} \
-			--vcf ~{tumorbasename}.SNP.vcf \
-			--out ~{plasmabasename}_PLASMA_VS_TUMOR
-
-		$MRDETECT_ROOT/bin/quality_score \
-			--pickle-name ~{pickle} \
-			--detections ~{plasmabasename}_PLASMA_VS_TUMOR.tsv \
-			--output_file ~{plasmabasename}_PLASMA_VS_TUMOR.svm.tsv
-
-		cp ~{blocklist} ./blacklist.txt.gz
+### parseControls
+This command processes the list of control files as paired bam/bai files and prints them out for detection.
 
-		~{filterAndDetectScript} \
-			~{tumorbasename}.SNP.vcf \
-			~{plasmabasename}_PLASMA_VS_TUMOR.svm.tsv \
-			~{plasmabasename}_PLASMA_VS_TUMOR_RESULT.csv >detection_output.txt
+### detectSNVs
+`MRDetect` proceed across three steps. This task is run through for the sample and for all the controls.
 
-		awk '$1 ~ "chr" {print $1"\t"$2"\t"$3"\t"$4}' detection_output.txt | uniq -c >detectionsPerSite.txt
+1- `pull_reads` takes any reads in the plasma .bam that corresponds to a SNP in the solid-tumour .vcf. 
 
+2- `quality_score` assesses the likelihood that any read is plasma based on the quality score and the trained pickle. 
 
+3- `filterAndDetect` keeps reads with high plasma likehood and removed those for which SNPs are in the blocklist. 
 
-### parseControls
-This command processes the list of control files as paired bam/bai files and prints them out for detection.
-
-		python 
-		import os, re
+		$MRDETECT_ROOT/bin/pull_reads \
+			--bam ~{plasmabam} \
+			--vcf ~{tumorvcf} \
+			--out PLASMA_VS_TUMOR.tsv
 
-		with open("~{controlFileListLoc}") as f:
-			for line in f:
-				line = line.rstrip()
-				tmp = line.split("\t")
-				r = tmp[0] + "\t" + tmp[1]
-				print(r)
-		f.close()
+		$MRDETECT_ROOT/bin/quality_score \
+			--pickle-name ~{pickle} \
+			--detections PLASMA_VS_TUMOR.tsv \
+			--output_file PLASMA_VS_TUMOR.svm.tsv
+
+		$MRDETECT_ROOT/bin/filterAndDetect \
+			--vcfid ~{tumorSampleName} --bamid ~{plasmaSampleName} \
+			--svm PLASMA_VS_TUMOR.svm.tsv \
+			--vcf ~{tumorvcf} \
+			--output ./ \
+			--blocklist ~{blocklist} \
+			--troubleshoot
 
-
 ### snvDetectionSummary
 
 Finally, `pwg_test.R` will process the controls and the sample to make a final call. 	
 
-		set -euo pipefail
-
 		cat ~{sep=' ' controlCalls} | awk '$1 !~ "BAM" {print}' >~{samplebasename}.HBCs.txt
 
-		Rscript --vanilla ~{DetectionRScript} -s ~{sampleCalls} -S ~{samplebasename} -c ~{samplebasename}.HBCs.txt
+		$MRDETECT_ROOT/bin/pwg_test  \
+			--sampleName ~{outputFileNamePrefix} \
+			--results ~{outputFileNamePrefix}.HBCs.csv \
+			--candidateSNVsCountFile ~{snpcount} \
+			--vafFile ~{vafFile} \
+			--pval ~{pvalue}