Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi-C workflow #139

Open
wants to merge 124 commits into
base: main
Choose a base branch
from
Open

Hi-C workflow #139

wants to merge 124 commits into from

Conversation

adthrasher
Copy link
Member

@adthrasher adthrasher commented Mar 22, 2024

Implements a workflow to generate a bowtie2-aligned BAM and a .hic file for analysis. This utilizes the commonly used HiC-Pro workflow.

@adthrasher adthrasher self-assigned this Mar 22, 2024
@adthrasher adthrasher mentioned this pull request Apr 29, 2024
@@ -209,6 +209,10 @@ task bwa_mem {
description: "Read group information for BWA to insert into the header. BWA format: '@RG\tID:foo\tSM:bar'",
group: "common",
}
skip_mate_rescue: "Skip mate rescue"
skip_pairing: "Skip pairing; mate rescue performed unless `skip_mate_rescue` also in use"
split_smallest: "For split alignment, take the alignment with the smallest coordinate as primary"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the alternative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the help text.

       -S            skip mate rescue
       -P            skip pairing; mate rescue performed unless -S also in use
       -5            for split alignment, take the alignment with the smallest coordinate as primary

-P is the only option ith an entry in the manual:

-P | In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.

Nothing overly helpful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-frantz - I've added some additional text, but it relies on google searching and reading Q&A on online forums. I don't see anything definitive from the bwa authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text for skip_[mate_rescue,pairing] looks good now! But I still have my original Q: What's the alternative to "take the alignment with the smallest coordinate as primary"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's random, since these have the same score (I think).

tools/bowtie2.wdl Outdated Show resolved Hide resolved
Copy link
Member

@a-frantz a-frantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another partial review. Still chugging through this 😅

tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
@adthrasher adthrasher requested a review from a-frantz January 15, 2025 20:35
Copy link
Member

@a-frantz a-frantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a doozy. I skimmed some parts. Need to do another closer review of some parts, but this is enough feedback for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this version dir should be 2.31.1-0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is named after bedtools, why does it have hic scripts in it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, it was just plain bedtools, but with the decision to remove the embedded scripts, those needed to get built in to some image and this one depends on bedtools. I can go somewhere else, but then we'll need to install bedtools in to another container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.5.4-0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe consolidate some of these new Docker images? It looks like we can maybe merge bedtools, hilow and juicertools?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were trying to avoid these types of monolithic images? We can do that, but is that the direction we want to go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. There's a balance to be found, but I'm not sure what it is yet... Responding to #139 (comment) in this thread to consolidate conversation.

Maybe we can merge hilow and juicertools (is my understanding correct that they are usually used together?), building that image with bedtools and moving the hic script which depends on bedtools into there? That's a bit "monolithic", but keeps the bedtools image "clean" and hopefully isn't too sprawling.

Thoughts on that?

args = get_args()

f=open(args.filter_pairs)
blackID=defaultdict(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was just a copy+paste, but can we use proper casing in this file? And blackID should probably be updated to exclude_list or something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we incorporate some Python tooling? A linter of some sort, formatter, etc? Don't really want to ask you to rewrite all these py scripts which appear to be mostly copy+paste, but also they aren't really conformant to any Python standards...

}
}

task fastq_to_sam {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task fastq_to_sam {
task fastq_to_ubam {

???

}

Float fastq_size = size(read_one_fastq_gz, "GiB") + size(read_two_fastq_gz, "GiB")
Int disk_size_gb = ceil(fastq_size * 2) + 10 + modify_disk_size_gb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this calculation solid? I'm not really sure what to expect for the size of a uBAM.

@@ -1327,7 +1327,90 @@ task faidx {
cpu: 1
memory: "4 GB"
disks: "~{disk_size_gb} GB"
container: "quay.io/biocontainers/samtools:1.17--h00cdaf9_0"
container: "quay.io/biocontainers/samtools:1.19.2--h50ea8bc_0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be updated for the rest of the file? I thought you added a lint or a CI or something to catch these cases...

external_help: "https://www.htslib.org/doc/samtools-sort.html",
}
prefix: "Prefix for the output file. The extension `.bam` will be added."
uncompressed: "Output uncompressed BAM?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think until streaming comes to WDL, we should probably avoid allowing uncompressed output

call bowtie2.build { input:
reference = reference_download.downloaded_file,
prefix = basename(reference_fa_name, ".fa.gz"),
ncpu = 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be hardcoded here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants