Skip to content
This repository has been archived by the owner on May 13, 2020. It is now read-only.

Dev #2

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Dev #2

wants to merge 20 commits into from

Conversation

bshifaw
Copy link

@bshifaw bshifaw commented Mar 12, 2020

  • Uploaded GATK 4.1.5.0 version of the gcnv pipeline.
  • Edited the import links to point to git urls.
  • Added input and references in input json files.

Copy link

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for putting this together, @bshifaw! Commented on a few minor typos and some more serious issues (which we've touched upon elsewhere, i.e., 1) how data-specific parameter recommendations should be handled, and 2) the need for a common set of sensible plumbing test resources). Not sure who should make the final call on such issues, but I think Comms needs to provide some guidance. No need to address them for v1 of this workflow, but perhaps we should think about addressing them across all of our workflows sooner rather than later.

Also, do we want to link to the old tutorial? What is the progress on the new one?

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
"CNVGermlineCaseScatteredWorkflow.gcnv_max_copy_number": 3,
"CNVGermlineCaseScatteredWorkflow.gcnv_max_calling_iters": 1,
"CNVGermlineCaseScatteredWorkflow.maximum_number_events_per_sample": 120,
"CNVGermlineCaseScatteredWorkflow.ref_fasta_dict": "gs://gatk-best-practices/cnv_germline_pipeline/Homo_sapiens_assembly19.truncated.dict",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to use hg19 test data?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I say onwards and upwards -- to hg38!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to use another set but I'll need help finding resource files (e.g. contig_ploidy_model_tar)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we decide on the appropriate sample BAMs, the resource models used in the case workflow will be generated by running the cohort workflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cnv_germline_case_scattered_workflow.wdl Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly just assumed the WDLs were the same as the repo, which are already reviewed and tested.

I think there was a copy paste hiccup in the readme responsible for a bunch of the typos.

Some of these may actually be changes to the file in the GATK repo, since a lot of the text is copied. If that's the case, can you make a GATK PR or open an issue?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
In additional, there are optional workflow-level and task-level parameters that may be set by advanced users; for example:

- ``CNVGermlineCohortWorkflow.do_explicit_gc_correction`` -- (optional) If true, perform explicit GC-bias correction when creating PoN and in subsequent denoising of case samples. If false, rely on PCA-based denoising to correct for GC bias.
- ``CNVGermlineCohortWorkflow.PreprocessIntervals.bin_length`` -- Size of bins (in bp) for coverage collection. *This must be the same value used for all case samples.*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for WGS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the tutorial it should be set to 1000 for WGS and 0 for exomes. I noticed the default value for the task that uses the variable (PreprocessIntervals) has a default value for the variable of 1000. Since this workflow is directed towards exomes should the input json have this variable set to 0? @samuelklee

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can technically use it for WES as well (to break targets up into smaller bins), we just haven't evaluated it. Let's leave the documentation unchanged, but set the default to 0 in the JSON---thanks for catching that, @bshifaw!

cnv_common_tasks.wdl Outdated Show resolved Hide resolved
cnv_common_tasks.wdl Outdated Show resolved Hide resolved
cnv_common_tasks.wdl Outdated Show resolved Hide resolved
"CNVGermlineCaseScatteredWorkflow.gcnv_max_copy_number": 3,
"CNVGermlineCaseScatteredWorkflow.gcnv_max_calling_iters": 1,
"CNVGermlineCaseScatteredWorkflow.maximum_number_events_per_sample": 120,
"CNVGermlineCaseScatteredWorkflow.ref_fasta_dict": "gs://gatk-best-practices/cnv_germline_pipeline/Homo_sapiens_assembly19.truncated.dict",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I say onwards and upwards -- to hg38!

cnv_germline_case_scattered_workflow.wdl Outdated Show resolved Hide resolved
Copy link

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you for putting it together

Comment on lines +89 to +91
### LICENSING :
Copyright Broad Institute, 2020 | BSD-3
This script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/openwdl/wdl/blob/master/LICENSE). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the license information not go in the beginning of each individual WDL?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be, but not sure why its missing for this repo. I'll add it now. As far why this section is here, its a note to inform the user that there might instance where a workflow will use none GATK tools and it's up to them to determine whether they can use it for their use case.

cnv_germline_case_scattered_workflow.inputs.json Outdated Show resolved Hide resolved
cnv_germline_case_workflow.inputs.json Outdated Show resolved Hide resolved
cnv_germline_case_scattered_workflow.wdl Outdated Show resolved Hide resolved
Copy link
Author

@bshifaw bshifaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for putting this together, @bshifaw! Commented on a few minor typos and some more serious issues (which we've touched upon elsewhere, i.e., 1) how data-specific parameter recommendations should be handled, and 2) the need for a common set of sensible plumbing test resources). Not sure who should make the final call on such issues, but I think Comms needs to provide some guidance. No need to address them for v1 of this workflow, but perhaps we should think about addressing them across all of our workflows sooner rather than later.

Also, do we want to link to the old tutorial? What is the progress on the new one?

@samuelklee

  • Added a link to the tutorial near the header.
  • I got through a few of the comments, the ones that still remain is related to using a new input set. There are a few downsampled hg38 exome bam files in the following bucket. I tried using them with the workflow but I ran into some errors, most likely because the parameters were not set right or the correct resource file wasn't available. I can work on this during the next version of the workflow.

@samuelklee
Copy link

Thanks for these changes, @bshifaw! We can stick with hg19 for now. Note that there are many other parameters that are still set to "plumbing" values, other than the one that you changed. However, I am not sure if setting these to default/realistic values (with the assumption that most users will not necessarily know to change them for their actual runs) will work with this plumbing data. If so, I am OK with removing all non-optional parameters from the JSON.

I'm inclined to move towards non-plumbing and more realistic data for showcasing Featured Workspaces, which will hopefully provide a more useful starting point and cost estimate for users. This is ultimately up to Comms, but it should also be considered as part of a larger conversation about testing, as we've discussed elsewhere.

I'll approve, but Mark W. should probably have the final say on this PR before it goes in.

@bshifaw
Copy link
Author

bshifaw commented Mar 20, 2020

Note that there are many other parameters that are still set to "plumbing" values, other than the one that you changed. However, I am not sure if setting these to default/realistic values (with the assumption that most users will not necessarily know to change them for their actual runs) will work with this plumbing data. If so, I am OK with removing all non-optional parameters from the JSON.

I'm inclined to move towards non-plumbing and more realistic data for showcasing Featured Workspaces, which will hopefully provide a more useful starting point and cost estimate for users. This is ultimately up to Comms, but it should also be considered as part of a larger conversation about testing, as we've discussed elsewhere.

I'll approve, but Mark W. should probably have the final say on this PR before it goes in.

@bshifaw bshifaw closed this Mar 20, 2020
@bshifaw bshifaw reopened this Mar 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants