Skip to content

Commit

Permalink
Several updates for making Malaria Joint Calling easier. (#449)
Browse files Browse the repository at this point in the history
* Reintroduced `ReblockGVCF` step.

- Updated HaplotypeCaller::ReblockGVCF to point to latest Docker image
  with fix for ReblockGVCF annotations.
- Updated HaplotypeCaller::ReblockGVCF to use 2 cores.
- Disabled DeepVariant/Pepper calling in `SRWholeGenome` by default.
- Removed `SRJointGenotyping::ReblockGVCF` - it should only be defined
  in one place.

* Removed hard filtered output file.

* Updating all GATK 4.3 tasks to GATK 4.5

* Disabled QC when running on a singe bam file input.

* Added a runtime_attr override for HaplotypeCaller subworkflow.

* Update Utils.wdl

Modified the disks from " LOCAL" to " SSD"

* Updating tasks to use `SSD` rather than `LOCAL` disk.

* Moved HaplotypeCaller and ReblockGVCF to SSD from HDD

* Fixed a bug in `compute_sr_stats.py` that allowed `nan`s

* Updated `sr-utils` docker to use `mamba` conda env solver.

* Updates to `sr-utils` docker image.

- Updated conda solver in `sr-utils` to `mamba`.
- Fixed minor deprecation warning in `compute_sr_stats.py`
- Fixed issue in `compute_sr_stats.py` that caused certain inputs to
fail due to missing base qualities or `nan` values.
- Updated version of `sr-utils` docker image to `0.2.2`.

* Updated `sr-utils` docker image to version `0.2.2`

* Added note for updating nightly GATK build.

---------

Co-authored-by: Shadi Zaheri <74751641+shadizaheri@users.noreply.github.com>
jonn-smith and shadizaheri authored Apr 24, 2024
1 parent c551987 commit 197d5b3
Showing 17 changed files with 159 additions and 210 deletions.
6 changes: 6 additions & 0 deletions docker/sr-utils/Dockerfile
Original file line number Diff line number Diff line change
@@ -5,8 +5,14 @@ MAINTAINER Kiran V Garimella
# copy other resources
COPY ./environment.yml /

# Set new conda solver so we don't have to wait forever:
RUN conda update -n base conda
RUN conda install -n base conda-libmamba-solver
RUN conda config --set solver libmamba

# install conda packages
RUN conda env create -f /environment.yml && conda clean -a
RUN echo "source activate sr-utils" > ~/.bashrc
ENV PATH=/opt/conda/envs/sr-utils/bin/:/root/google-cloud-sdk/bin/:${PATH}

# Install BWA-MEM2:
2 changes: 1 addition & 1 deletion docker/sr-utils/Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
IMAGE_NAME = sr-utils
VERSION = 0.2.1
VERSION = 0.2.2

TAG1 = us.gcr.io/broad-dsp-lrma/$(IMAGE_NAME):$(VERSION)
TAG2 = us.gcr.io/broad-dsp-lrma/$(IMAGE_NAME):latest
1 change: 0 additions & 1 deletion docker/sr-utils/environment.yml
Original file line number Diff line number Diff line change
@@ -9,4 +9,3 @@ dependencies:
- pysam
- numpy
- tqdm

20 changes: 14 additions & 6 deletions docker/sr-utils/python/compute_sr_stats.py
Original file line number Diff line number Diff line change
@@ -10,7 +10,7 @@ def n50(lengths):
csum = np.cumsum(all_len)
n2 = int(sum(lengths) / 2)
csumn2 = min(csum[csum >= n2])
ind = np.where(csum == csumn2)
ind = np.where(csum == csumn2)[0]

return all_len[int(ind[0])]

@@ -36,7 +36,10 @@ def get_bam_stats(bam_file_path, qual_thresh=None):
for read in tqdm(bam_file, desc=f"Collecting Bam Stats" + (f" (rq >= {qual_thresh})" if qual_thresh else ""),
total=total_reads, unit=" read"):
l = len(read.query_sequence)
q = np.mean(read.query_qualities)
if read.query_qualities is not None:
q = np.mean(read.query_qualities)
else:
q = 0

if qual_thresh and q < qual_thresh:
continue
@@ -51,6 +54,11 @@ def get_bam_stats(bam_file_path, qual_thresh=None):
return n_reads, total_bases, np.mean(quals), np.median(quals), np.array(read_lengths)


def nan_zero_wrap(float_val: float) -> float:
"""Return 0 if the value is NaN, otherwise return the value."""
return float_val if not np.isnan(float_val) else 0


def main():
parser = argparse.ArgumentParser(description='Compute short read bam file stats', prog='compute_sr_stats')
parser.add_argument('-q', '--qual-threshold', type=int, default=0, help="Phred-scale quality threshold")
@@ -59,10 +67,10 @@ def main():

n_reads, n_bases, mean_qual, median_qual, read_lengths = get_bam_stats(args.bam_file_path, args.qual_threshold)

print(f"reads\t{n_reads}")
print(f"bases\t{n_bases}")
print(f"mean_qual\t{mean_qual}")
print(f"median_qual\t{median_qual}")
print(f"reads\t{nan_zero_wrap(n_reads)}")
print(f"bases\t{nan_zero_wrap(n_bases)}")
print(f"mean_qual\t{nan_zero_wrap(mean_qual)}")
print(f"median_qual\t{nan_zero_wrap(median_qual)}")

print(f"read_mean\t{int(np.mean(read_lengths)) if len(read_lengths) > 0 else 0}")
print(f"read_median\t{int(np.median(read_lengths)) if len(read_lengths) > 0 else 0}")
103 changes: 52 additions & 51 deletions wdl/pipelines/ILMN/VariantCalling/SRWholeGenome.wdl
Original file line number Diff line number Diff line change
@@ -138,39 +138,51 @@ workflow SRWholeGenome {
File bam = select_first([MergeAllReads.merged_bam, aligned_bams[0]])
File bai = select_first([MergeAllReads.merged_bai, aligned_bais[0]])

# Collect sample-level metrics:
call AM.SamStatsMap as SamStats { input: bam = bam }
call FastQC.FastQC as FastQC { input: bam = bam, bai = bai }
call Utils.ComputeGenomeLength as ComputeGenomeLength { input: fasta = ref_map['fasta'] }
call SRUTIL.ComputeBamStats as ComputeBamStats { input: bam_file = bam }

if (defined(bed_to_compute_coverage)) {
call AM.MosDepthOverBed as MosDepth {
input:
bam = bam,
bai = bai,
bed = select_first([bed_to_compute_coverage])
}
# Only collect metrics if we have multiple input bam files
if (length(aligned_bams) > 1) {
# Collect sample-level metrics:
call AM.SamStatsMap as SamStats { input: bam = bam }
call FastQC.FastQC as FastQC { input: bam = bam, bai = bai }
call Utils.ComputeGenomeLength as ComputeGenomeLength { input: fasta = ref_map['fasta'] }
call SRUTIL.ComputeBamStats as ComputeBamStats { input: bam_file = bam }

if (defined(bed_to_compute_coverage)) {
call AM.MosDepthOverBed as MosDepth {
input:
bam = bam,
bai = bai,
bed = select_first([bed_to_compute_coverage])
}

call COV.SummarizeDepthOverWholeBed as RegionalCoverage {
input:
mosdepth_output = MosDepth.regions
call COV.SummarizeDepthOverWholeBed as RegionalCoverage {
input:
mosdepth_output = MosDepth.regions
}
}
}

call FF.FinalizeToFile as FinalizeBam { input: outdir = bam_dir, file = bam, name = "~{participant_name}.bam" }
call FF.FinalizeToFile as FinalizeBai { input: outdir = bam_dir, file = bai, name = "~{participant_name}.bam.bai" }
call FF.FinalizeToFile as FinalizeBam { input: outdir = bam_dir, file = bam, name = "~{participant_name}.bam" }
call FF.FinalizeToFile as FinalizeBai { input: outdir = bam_dir, file = bai, name = "~{participant_name}.bam.bai" }

if (defined(bed_to_compute_coverage)) { call FF.FinalizeToFile as FinalizeRegionalCoverage { input: outdir = bam_dir, file = select_first([RegionalCoverage.cov_summary]) } }
if (defined(bed_to_compute_coverage)) { call FF.FinalizeToFile as FinalizeRegionalCoverage { input: outdir = bam_dir, file = select_first([RegionalCoverage.cov_summary]) } }

call FF.FinalizeToFile as FinalizeFastQCReport {
input:
outdir = metrics_dir,
file = FastQC.report
}

call FF.FinalizeToFile as FinalizeFastQCReport {
input:
outdir = metrics_dir,
file = FastQC.report
# Calculate some final metrics that we need temporary variables for:
Float tmp_average_identity = 100.0 - (100.0*SamStats.stats_map['mismatches']/SamStats.stats_map['bases_mapped'])
Float tmp_aligned_frac_bases = SamStats.stats_map['bases_mapped']/SamStats.stats_map['total_length']
Float tmp_aligned_est_fold_cov = SamStats.stats_map['bases_mapped']/ComputeGenomeLength.length
Float tmp_aligned_num_reads = FastQC.stats_map['number_of_reads']
Float tmp_aligned_num_bases = SamStats.stats_map['bases_mapped']
Float tmp_aligned_read_length_mean = FastQC.stats_map['read_length']
Float tmp_insert_size_average = SamStats.stats_map['insert_size_average']
Float tmp_insert_size_standard_deviation = SamStats.stats_map['insert_size_standard_deviation']
Float tmp_pct_properly_paired_reads = SamStats.stats_map['percentage_of_properly_paired_reads_%']
}


####################################################################################################
# Some input handling:
@@ -385,15 +397,8 @@ workflow SRWholeGenome {
}
}

call VARUTIL.SelectVariants as RemoveFilteredVariants {
input:
vcf = ScoreIndelVariantAnnotations.scored_vcf,
vcf_index = ScoreIndelVariantAnnotations.scored_vcf_index,
prefix = participant_name + ".filtered"
}

# Create a Keyfile for finalization:
File keyfile = RemoveFilteredVariants.vcf_out_index
File keyfile = select_first([FingerprintAndBarcodeVcf.barcode_file, ScoreIndelVariantAnnotations.scored_vcf_index])

# Finalize the raw Joint Calls:
call FF.FinalizeToFile as FinalizeHCVcf { input: outdir = smalldir, keyfile = keyfile, file = RenameRawHcVcf.new_sample_name_vcf }
@@ -406,8 +411,6 @@ workflow SRWholeGenome {
# Finalize the reclibrated / filtered variants:
call FF.FinalizeToFile as FinalizeHCRescoredVcf { input: outdir = smalldir, keyfile = keyfile, file = ScoreIndelVariantAnnotations.scored_vcf }
call FF.FinalizeToFile as FinalizeHCRescoredTbi { input: outdir = smalldir, keyfile = keyfile, file = ScoreIndelVariantAnnotations.scored_vcf_index }
call FF.FinalizeToFile as FinalizeHCRescoredFilteredVcf { input: outdir = smalldir, keyfile = keyfile, file = RemoveFilteredVariants.vcf_out }
call FF.FinalizeToFile as FinalizeHCRescoredFilteredTbi { input: outdir = smalldir, keyfile = keyfile, file = RemoveFilteredVariants.vcf_out_index }

# Finalize other outputs:
if (defined(fingerprint_haploytpe_db_file)) {
@@ -478,31 +481,31 @@ workflow SRWholeGenome {
}

output {
File aligned_bam = FinalizeBam.gcs_path
File aligned_bai = FinalizeBai.gcs_path

Float aligned_num_reads = FastQC.stats_map['number_of_reads']
Float aligned_num_bases = SamStats.stats_map['bases_mapped']
Float aligned_frac_bases = SamStats.stats_map['bases_mapped']/SamStats.stats_map['total_length']
Float aligned_est_fold_cov = SamStats.stats_map['bases_mapped']/ComputeGenomeLength.length
File? aligned_bam = FinalizeBam.gcs_path
File? aligned_bai = FinalizeBai.gcs_path

Float aligned_read_length_mean = FastQC.stats_map['read_length']
Float? aligned_num_reads = tmp_aligned_num_reads
Float? aligned_num_bases = tmp_aligned_num_bases
Float? aligned_frac_bases = tmp_aligned_frac_bases
Float? aligned_est_fold_cov = tmp_aligned_est_fold_cov

Float insert_size_average = SamStats.stats_map['insert_size_average']
Float insert_size_standard_deviation = SamStats.stats_map['insert_size_standard_deviation']
Float pct_properly_paired_reads = SamStats.stats_map['percentage_of_properly_paired_reads_%']
Float? aligned_read_length_mean = tmp_aligned_read_length_mean

Float average_identity = 100.0 - (100.0*SamStats.stats_map['mismatches']/SamStats.stats_map['bases_mapped'])
Float? insert_size_average = tmp_insert_size_average
Float? insert_size_standard_deviation = tmp_insert_size_standard_deviation
Float? pct_properly_paired_reads = tmp_pct_properly_paired_reads

File fastqc_report = FinalizeFastQCReport.gcs_path
Float? average_identity = tmp_average_identity

Boolean successfully_processed = true
File? fastqc_report = FinalizeFastQCReport.gcs_path

File? bed_cov_summary = FinalizeRegionalCoverage.gcs_path

File? fingerprint_vcf = FinalizeFingerprintVcf.gcs_path
String? barcode = FingerprintAndBarcodeVcf.barcode

Boolean successfully_processed = true

########################################
File? dvp_vcf = FinalizeDVPepperVcf.gcs_path
@@ -520,7 +523,5 @@ workflow SRWholeGenome {
File? hc_raw_tbi = FinalizeHCTbi.gcs_path
File? hc_rescored_vcf = FinalizeHCRescoredVcf.gcs_path
File? hc_rescored_tbi = FinalizeHCRescoredTbi.gcs_path
File? hc_rescored_filtered_vcf = FinalizeHCRescoredFilteredVcf.gcs_path
File? hc_rescored_filtered_tbi = FinalizeHCRescoredFilteredTbi.gcs_path
}
}
4 changes: 2 additions & 2 deletions wdl/tasks/Preprocessing/CollectParentsKmerStats.wdl
Original file line number Diff line number Diff line change
@@ -219,7 +219,7 @@ task ParentalReadsRepartitionAndMerylConfigure {
runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " LOCAL" # LOCAL because this task is mostly IO operation
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " SSD" # If this is too slow, revert to LOCAL (LOCAL because this task is mostly IO operation)
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
@@ -484,7 +484,7 @@ task MerylMergeAndSubtract {
runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " LOCAL"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " SSD" # If this is too slow, revert to LOCAL
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
8 changes: 5 additions & 3 deletions wdl/tasks/QC/Fingerprinting.wdl
Original file line number Diff line number Diff line change
@@ -222,13 +222,15 @@ task ExtractRelevantGenotypingReads {
RuntimeAttr? runtime_attr_override
}
Int disk_size = 50 + 2*ceil(size(aligned_bam, "GB")) + 2*ceil(size(genotyping_sites_bed, "GB")) + 2*ceil(size(aligned_bai, "GB"))
command <<<
set -eux
export GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token`
samtools view -h -@ 1 \
samtools view -h -@ 2 \
--write-index \
-o "relevant_reads.bam##idx##relevant_reads.bam.bai" \
-M -L ~{genotyping_sites_bed} \
@@ -244,7 +246,7 @@ task ExtractRelevantGenotypingReads {
RuntimeAttr default_attr = object {
cpu_cores: 4,
mem_gb: 8,
disk_gb: 375, # will use LOCAL SSD for speeding things up
disk_gb: disk_size,
boot_disk_gb: 10,
preemptible_tries: 0,
max_retries: 1,
@@ -254,7 +256,7 @@ task ExtractRelevantGenotypingReads {
runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " LOCAL"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " SSD" # if SSD is too slow, revert to LOCAL
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
5 changes: 3 additions & 2 deletions wdl/tasks/Utility/PBUtils.wdl
Original file line number Diff line number Diff line change
@@ -79,6 +79,7 @@ task ShardLongReads {
# when running large scale workflows, we sometimes see errors like the following
# A resource limit has delayed the operation: generic::resource_exhausted: allocating: selecting resources: selecting region and zone:
# no available zones: 2763 LOCAL_SSD_TOTAL_GB (738/30000 available) usage too high
# NOTE: Changed disk type to SSD to prevent the above issue -JTS
zones: "select which zone (GCP) to run this task"
num_ssds: "number of SSDs to use"
runtime_attr_override: "Override default runtime attributes."
@@ -136,7 +137,7 @@ task ShardLongReads {
runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " LOCAL"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " SSD" # If SSD is too slow, revert to LOCAL
zones: zones
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
@@ -303,7 +304,7 @@ task ExtractHifiReads {
runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " LOCAL"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " SSD" # If SSD is too slow, revert to LOCAL
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
12 changes: 6 additions & 6 deletions wdl/tasks/Utility/SRUtils.wdl
Original file line number Diff line number Diff line change
@@ -219,7 +219,7 @@ task BwaMem2 {
boot_disk_gb: 10,
preemptible_tries: 1,
max_retries: 1,
docker: "us.gcr.io/broad-dsp-lrma/sr-utils:0.2.1"
docker: "us.gcr.io/broad-dsp-lrma/sr-utils:0.2.2"
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
runtime {
@@ -438,7 +438,7 @@ task BaseRecalibrator {
boot_disk_gb: 10,
preemptible_tries: 1,
max_retries: 1,
docker: "us.gcr.io/broad-gatk/gatk:4.3.0.0"
docker: "us.gcr.io/broad-gatk/gatk:4.5.0.0"
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
runtime {
@@ -521,7 +521,7 @@ task ApplyBQSR {
boot_disk_gb: 10,
preemptible_tries: 1,
max_retries: 1,
docker: "us.gcr.io/broad-gatk/gatk:4.3.0.0"
docker: "us.gcr.io/broad-gatk/gatk:4.5.0.0"
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
runtime {
@@ -637,7 +637,7 @@ task ComputeBamStats {
boot_disk_gb: 10,
preemptible_tries: 1,
max_retries: 2,
docker: "us.gcr.io/broad-dsp-lrma/sr-utils:0.2.1"
docker: "us.gcr.io/broad-dsp-lrma/sr-utils:0.2.2"
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
runtime {
@@ -735,7 +735,7 @@ task IndexFeatureFile {
boot_disk_gb: 10,
preemptible_tries: 1,
max_retries: 1,
docker: "us.gcr.io/broad-gatk/gatk:4.3.0.0"
docker: "us.gcr.io/broad-gatk/gatk:4.5.0.0"
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
runtime {
@@ -807,7 +807,7 @@ task RevertBaseQualities {
boot_disk_gb: 10,
preemptible_tries: 1,
max_retries: 1,
docker: "us.gcr.io/broad-gatk/gatk:4.3.0.0"
docker: "us.gcr.io/broad-gatk/gatk:4.5.0.0"
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
runtime {
Loading

0 comments on commit 197d5b3

Please sign in to comment.