Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

olgabot/sourmash sig merge #117

Merged
merged 52 commits into from
Mar 9, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
df31610
Update extract_per_cell_fastqs to not say __aligned__aligned and retr…
olgabot Oct 28, 2020
700272d
Initial commit for adding sourmash sig merge on aligned/unaligned fro…
olgabot Oct 28, 2020
c4f9521
Update changelog
olgabot Oct 28, 2020
7d3e215
Try to get grouptuple to work
olgabot Oct 28, 2020
533dc94
Set minimum UMI per cell to be a default of 1000
olgabot Oct 29, 2020
da84ba4
Set test min UMI per cell as 5
olgabot Oct 29, 2020
6d30732
Remove unused --shard_size option
olgabot Oct 29, 2020
8fc0c33
Add option for skipping sig merge
olgabot Oct 29, 2020
63273e5
Update Dockerfile
olgabot Oct 29, 2020
054d8b3
Add test for --skip_sig_merge
olgabot Oct 29, 2020
f1304ca
Update changelog
olgabot Jan 5, 2021
a579175
Use more realistic scales and ksizes
olgabot Jan 5, 2021
ef43907
regular test doesn't fail anymore
olgabot Jan 5, 2021
08c32ec
Merge branch 'dev' into olgabot/sourmash-sig-merge
pranathivemuri Jan 5, 2021
d0bec5c
Update bam config
olgabot Jan 6, 2021
000d6ca
Add dump ch_sourmash_sketches_mixed
olgabot Jan 6, 2021
6bce013
Update schema
olgabot Jan 6, 2021
04e62d4
Merge remote-tracking branch 'origin' into olgabot/sourmash-sig-merge
olgabot Jan 6, 2021
ba765ff
Add params.ksizes to sketch output
olgabot Jan 7, 2021
ed5e72b
Add peptide_molecules
olgabot Jan 7, 2021
1718502
add check for skip_compute in sig merge logic
olgabot Jan 7, 2021
0029b51
Add header
olgabot Jan 7, 2021
c04a5a1
Only mix sketches if not skip_compute
olgabot Jan 7, 2021
5893884
param --> params
olgabot Jan 7, 2021
4118c6c
Add some projectdir stuff
olgabot Jan 9, 2021
baa96f8
More projectDir fixes
olgabot Jan 11, 2021
8ba5db1
Do per-ksize sourmash sig merge
olgabot Jan 11, 2021
c27b2b4
Add sourmash describe csvs to multiqc
olgabot Jan 11, 2021
05f6702
Update ProjectDir
olgabot Jan 11, 2021
2fe4523
Properly save translate output
olgabot Jan 11, 2021
27d95f0
Add dump of sourmash sketches
olgabot Jan 11, 2021
e68db00
Fixing sourmash sig merge
olgabot Jan 11, 2021
7195eae
Add ch_sourmash_sig_describe_nucleotides
olgabot Jan 11, 2021
9d090b4
more if/else
olgabot Jan 11, 2021
5bea4c9
Update changelog
olgabot Jan 16, 2021
9682b03
Getting "sig merge" to finally run
olgabot Jan 16, 2021
76b2ed7
Add option to skip sig merge
olgabot Jan 16, 2021
3c95f82
Update validate_sketch_value to only allow a single value
olgabot Jan 16, 2021
1321968
Change sketch values to single value
olgabot Jan 16, 2021
dc0ecdd
peptide_molecule --> translate_peptide_molecule
olgabot Jan 17, 2021
1fe32b1
add "translate_" to peptide ksize and jaccard threshold
olgabot Jan 18, 2021
1297886
Do sig merge on individual moltypes
olgabot Mar 9, 2021
d8e764b
Add test_sig_merge
olgabot Mar 9, 2021
232ac0d
Add test_sig_merge to CI
olgabot Mar 9, 2021
253fa83
Don't allow multiple sketch values
olgabot Mar 9, 2021
67aec5e
Reduce bloom filter table size
olgabot Mar 9, 2021
8c99f9f
Sig merge is working!
olgabot Mar 9, 2021
fccec83
Make test params more realistic
olgabot Mar 9, 2021
8de30e3
Update default ksizes, add track abundance true
olgabot Mar 9, 2021
0aade17
Update variables in merge_renamed_sigs.pyh
olgabot Mar 9, 2021
ad36259
Get sourmash compare to happen on correct ksizes and moltypes
olgabot Mar 9, 2021
43bbafd
Merge branch 'dev' into olgabot/sourmash-sig-merge
olgabot Mar 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Initial release of nf-core/kmermaid, created with the [nf-core](http://nf-co.re/
barcode fastq
* Add version printing for sencha, bam2fasta, and sourmash in Dockerfile, update versions in environment.yml
* For processes translate, sourmash compute add cpus=1 as they are only serial ([#107](https://github.com/nf-core/kmermaid/pull/107))
* Add `sourmash sig merge` for aligned/unaligned signatures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you write a detailed description? also sourmash_sig_merge


### `Fixed`

Expand Down
74 changes: 64 additions & 10 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -894,9 +894,9 @@ if (params.tenx_tgz || params.bam) {
.set{ tenx_reads_with_good_barcodes_ch }

process extract_per_cell_fastqs {
tag "${is_aligned_channel_id}__${cell_barcode}"
tag "${fastq_id}"
label "low_memory"
errorStrategy 'ignore'
errorStrategy { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'ignore' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool!

publishDir "${params.outdir}/10x-fastqs/per-cell/${channel_id}/", mode: 'copy', pattern: '*.fastq.gz', saveAs: { filename -> "${filename.replace("|", "-")}"}

input:
Expand All @@ -909,10 +909,8 @@ if (params.tenx_tgz || params.bam) {
set val(fastq_id), val(cell_id), val(is_aligned) into ch_fastq_id_to_cell_id_is_aligned

script:
is_aligned_channel_id = "${channel_id}__${is_aligned}"
processes = "--processes ${task.cpus}"
this_cell_barcode = tenx_cell_barcode_pattern.replace('([ACGT]+)', cell_barcode)
fastq_id = "${is_aligned_channel_id}__${is_aligned}__${cell_barcode}"
fastq_id = "${channel_id}__${is_aligned}__${cell_barcode}"
cell_id = "${channel_id}__${cell_barcode}"
this_cell_fastq_gz = "${fastq_id}.fastq.gz"
"""
Expand Down Expand Up @@ -1276,7 +1274,7 @@ if (!params.remove_ribo_rna) {

output:
file(csv) into ch_sourmash_sig_describe_nucleotides
set val(sketch_id), val("dna"), val(ksize), val(sketch_value), file(sig) into sourmash_sketches_all_nucleotide
set val(sample_id), val("dna"), val(ksize), file(sig) into sourmash_sketches_all_nucleotide

script:
// Don't calculate DNA signature if this is protein, to minimize disk,
Expand All @@ -1301,7 +1299,7 @@ if (!params.remove_ribo_rna) {
sourmash sig describe --csv ${csv} ${sig}
"""
}
sourmash_sketches_nucleotide = sourmash_sketches_all_nucleotide.filter{ it[4].size() > 0 }
sourmash_sketches_nucleotide = sourmash_sketches_all_nucleotide.filter{ it[3].size() > 0 }
}
} else {
sourmash_sketches_nucleotide = Channel.empty()
Expand Down Expand Up @@ -1344,7 +1342,7 @@ if (!params.skip_compute && (protein_input || params.reference_proteome_fasta)){

output:
file(csv) into ch_sourmash_sig_describe_peptides
set val(sketch_id), val(molecule), val(ksize), val(sketch_value), file(sig) into sourmash_sketches_all_peptide
set val(sample_id), val(molecule), val(ksize), file(sig) into sourmash_sketches_all_peptide

script:
sketch_id = make_sketch_id(molecule, ksize, sketch_value, track_abundance, sketch_style)
Expand All @@ -1369,11 +1367,68 @@ if (!params.skip_compute && (protein_input || params.reference_proteome_fasta)){
sourmash sig describe --csv ${csv} ${sig}
"""
}
sourmash_sketches_peptide = sourmash_sketches_all_peptide.filter{ it[4].size() > 0 }
sourmash_sketches_peptide = sourmash_sketches_all_peptide.filter{ it[3].size() > 0 }
} else {
sourmash_sketches_peptide = Channel.empty()
}

if (params.bam || params.tenx_tgz) {
// Merge signatures from same sample id and sketch id

sourmash_sketches_nucleotide
.mix ( sourmash_sketches_peptide )
.set { ch_sourmash_sketches_mixed}

ch_fastq_id_to_cell_id_is_aligned
.combine ( ch_sourmash_sketches_mixed )
.dump( tag: 'fastq_id_to_cells__join__sketches' )
.groupTuple( by: 1 )
.dump( tag: 'fastq_id_to_cells__join__sketches__grouptuple' )
.set { ch_sourmash_sketches_to_merge }

process sourmash_sig_merge {
tag "${sig_id}"
label "low_memory"
publishDir "${params.outdir}/sketches_merged/${sketch_id}", mode: "${params.publish_dir_mode}",
saveAs: {filename ->
if (filename.indexOf(".csv") > 0) "description/$filename"
else if (filename.indexOf(".sig") > 0) "sigs/$filename"
else null
}

input:
set val(molecule), val(ksize), val(sketch_style), val(sketch_value), val(sample_id), file(reads) from ch_sourmash_sketches_to_merge

output:
file(csv) into ch_sourmash_sig_describe_merged
set val(sketch_id), val(molecule), val(ksize), val(sketch_value), file(sig) into sourmash_sketches

script:
// sketch_id = make_sketch_id(molecule, ksize, sketch_value, track_abundance, sketch_style)
sketch_value_flag = make_sketch_value_flag(sketch_style, sketch_value)
track_abundance_flag = track_abundance ? '--track-abundance' : ''
processes = "--processes ${task.cpus}"
sig_id = "${sample_id}__${sketch_id}"
sig = "${sig_id}.sig"
csv = "${sig_id}.csv"
"""
sourmash compute \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are doing sourmash compute twice? is this also dependent on skip_compute flag?

${sketch_value_flag} \\
--ksizes $ksize \\
--input-is-protein \\
--$molecule \\
--name '${sample_id}' \\
--no-dna \\
$processes \\
$track_abundance_flag \\
--output ${sig} \\
$reads
sourmash sig describe --csv ${csv} ${sig}
"""
}

}

if (params.split_kmer){
process ska_compare_sketches {
tag "${sketch_id}"
Expand All @@ -1397,7 +1452,6 @@ if (params.split_kmer){
if (!params.split_kmer && !params.skip_compare && !params.skip_compute) {
process sourmash_compare_sketches {
// Combine peptide and nucleotide sketches
sourmash_sketches = sourmash_sketches_peptide.concat(sourmash_sketches_nucleotide)
tag "${sketch_id}"
publishDir "${params.outdir}/compare_sketches", mode: 'copy'

Expand Down