Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

olgabot/sourmash sig merge #117

Merged
merged 52 commits into from
Mar 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
df31610
Update extract_per_cell_fastqs to not say __aligned__aligned and retr…
olgabot Oct 28, 2020
700272d
Initial commit for adding sourmash sig merge on aligned/unaligned fro…
olgabot Oct 28, 2020
c4f9521
Update changelog
olgabot Oct 28, 2020
7d3e215
Try to get grouptuple to work
olgabot Oct 28, 2020
533dc94
Set minimum UMI per cell to be a default of 1000
olgabot Oct 29, 2020
da84ba4
Set test min UMI per cell as 5
olgabot Oct 29, 2020
6d30732
Remove unused --shard_size option
olgabot Oct 29, 2020
8fc0c33
Add option for skipping sig merge
olgabot Oct 29, 2020
63273e5
Update Dockerfile
olgabot Oct 29, 2020
054d8b3
Add test for --skip_sig_merge
olgabot Oct 29, 2020
f1304ca
Update changelog
olgabot Jan 5, 2021
a579175
Use more realistic scales and ksizes
olgabot Jan 5, 2021
ef43907
regular test doesn't fail anymore
olgabot Jan 5, 2021
08c32ec
Merge branch 'dev' into olgabot/sourmash-sig-merge
pranathivemuri Jan 5, 2021
d0bec5c
Update bam config
olgabot Jan 6, 2021
000d6ca
Add dump ch_sourmash_sketches_mixed
olgabot Jan 6, 2021
6bce013
Update schema
olgabot Jan 6, 2021
04e62d4
Merge remote-tracking branch 'origin' into olgabot/sourmash-sig-merge
olgabot Jan 6, 2021
ba765ff
Add params.ksizes to sketch output
olgabot Jan 7, 2021
ed5e72b
Add peptide_molecules
olgabot Jan 7, 2021
1718502
add check for skip_compute in sig merge logic
olgabot Jan 7, 2021
0029b51
Add header
olgabot Jan 7, 2021
c04a5a1
Only mix sketches if not skip_compute
olgabot Jan 7, 2021
5893884
param --> params
olgabot Jan 7, 2021
4118c6c
Add some projectdir stuff
olgabot Jan 9, 2021
baa96f8
More projectDir fixes
olgabot Jan 11, 2021
8ba5db1
Do per-ksize sourmash sig merge
olgabot Jan 11, 2021
c27b2b4
Add sourmash describe csvs to multiqc
olgabot Jan 11, 2021
05f6702
Update ProjectDir
olgabot Jan 11, 2021
2fe4523
Properly save translate output
olgabot Jan 11, 2021
27d95f0
Add dump of sourmash sketches
olgabot Jan 11, 2021
e68db00
Fixing sourmash sig merge
olgabot Jan 11, 2021
7195eae
Add ch_sourmash_sig_describe_nucleotides
olgabot Jan 11, 2021
9d090b4
more if/else
olgabot Jan 11, 2021
5bea4c9
Update changelog
olgabot Jan 16, 2021
9682b03
Getting "sig merge" to finally run
olgabot Jan 16, 2021
76b2ed7
Add option to skip sig merge
olgabot Jan 16, 2021
3c95f82
Update validate_sketch_value to only allow a single value
olgabot Jan 16, 2021
1321968
Change sketch values to single value
olgabot Jan 16, 2021
dc0ecdd
peptide_molecule --> translate_peptide_molecule
olgabot Jan 17, 2021
1fe32b1
add "translate_" to peptide ksize and jaccard threshold
olgabot Jan 18, 2021
1297886
Do sig merge on individual moltypes
olgabot Mar 9, 2021
d8e764b
Add test_sig_merge
olgabot Mar 9, 2021
232ac0d
Add test_sig_merge to CI
olgabot Mar 9, 2021
253fa83
Don't allow multiple sketch values
olgabot Mar 9, 2021
67aec5e
Reduce bloom filter table size
olgabot Mar 9, 2021
8c99f9f
Sig merge is working!
olgabot Mar 9, 2021
fccec83
Make test params more realistic
olgabot Mar 9, 2021
8de30e3
Update default ksizes, add track abundance true
olgabot Mar 9, 2021
0aade17
Update variables in merge_renamed_sigs.pyh
olgabot Mar 9, 2021
ad36259
Get sourmash compare to happen on correct ksizes and moltypes
olgabot Mar 9, 2021
43bbafd
Merge branch 'dev' into olgabot/sourmash-sig-merge
olgabot Mar 9, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,15 +67,18 @@ jobs:
- "test --sketch_scaled 2,4 --sketch_num_hashes_log2 false"
- "test_bam --barcodes_file false --rename_10x_barcodes false --save_fastas false --write_barcodes_meta_csv false"
- "test_bam --rename_10x_barcodes false --write_barcodes_meta_csv false"
- "test_bam --skip_sig_merge"
- "test_bam --write_barcodes_meta_csv false"
- "test_bam --barcodes_file false --rename_10x_barcodes false"
- "test_bam --rename_10x_barcodes false"
- "test_fastas"
- "test_protein_fastas"
- "test_remove_ribo"
- "test_sig_merge"
- "test_tenx_tgz"
- "test_translate"
- "test_translate_bam"
- "test_translate_bam --skip_sig_merge"
steps:
- name: Check out pipeline code
uses: actions/checkout@v2
Expand Down
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"python.formatting.provider": "black"
}
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@ Initial release of nf-core/kmermaid, created with the [nf-core](https://nf-co.re
barcode fastq
* Add version printing for sencha, bam2fasta, and sourmash in Dockerfile, update versions in environment.yml
* For processes translate, sourmash compute add cpus=1 as they are only serial ([#107](https://github.com/nf-core/kmermaid/pull/107))
* Add `sourmash sig merge` for aligned/unaligned signatures from bam files, and add `--skip_sig_merge` option to turn it off
* Add `--protein_fastas` option for creating sketches of already-translated protein sequences
* Add `--skip_compare option` to skip `sourmash_compare_sketches` process
* Add merging of aligned/unaligned parts of single-cell data ([#117](https://github.com/nf-core/kmermaid/pull/117))
* Add renamed package dependency orpheum (used to be known as sencha)

### `Fixed`
Expand Down Expand Up @@ -61,3 +63,5 @@ barcode fastq
### `Dependencies`

### `Deprecated`

* Removed ability to specify multiple `--scaled` or `--num-hashes` values to enable merging of signatures
112 changes: 112 additions & 0 deletions bin/merge_rename_sigs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
#!/usr/bin/env python
import argparse
import itertools
import shutil

import sourmash


def merge(filenames, ksize, moltype, name=None, outsig=None):
"""
merge one or more signatures.

Adapted from 'sourmash sig merge' command
"""

first_sig = None
mh = None
total_loaded = 0

for sigfile in filenames:
this_n = 0

for sigobj in sourmash.load_file_as_signatures(
sigfile, ksize=ksize, select_moltype=moltype
):

if sigobj is None:
error(
"No signature in file {}",
sigfile,
)

# first signature? initialize a bunch of stuff
if first_sig is None:
first_sig = sigobj
mh = first_sig.minhash.copy_and_clear()

try:
sigobj_mh = sigobj.minhash

mh.merge(sigobj_mh)
except:
error(
"ERROR when merging signature '{}' ({}) from file {}",
sigobj.name(),
sigobj.md5sum()[:8],
sigfile,
)
raise

this_n += 1
total_loaded += 1

merged_sigobj = sourmash.SourmashSignature(mh)
if name is not None:
merged_sigobj._name = name

if outsig is not None:
with open(outsig, "wt") as f:
sourmash.save_signatures([merged_sigobj], fp=f)

return merged_sigobj


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="""Merge signatures with same ksize and moltype"""
)
# Signature files
parser.add_argument("sigfiles", nargs="+")
parser.add_argument(
"--moltypes",
type=str,
help="Molecule types, comma-separated. e,g. 'protein,dayhoff'",
required=True,
)
parser.add_argument(
"--ksizes", type=str, help="K-mer sizes to combine", required=True
)

parser.add_argument(
"-o", "--outsig", type=str, help="Signature file to output to", required=True
)
parser.add_argument(
"-n",
"--name",
type=str,
help="Name of the signature to use",
)

args = parser.parse_args()

# Only iterate and read over the sigfiles if there is really something to merge
# "something to merge" = there is more than one sigfile. otherwise there's no point
# in reading in the files only to make the same file again
if len(args.sigfiles) > 1:
ksizes = map(int, args.ksizes.split(","))
moltypes = args.moltypes.split(",")

merged_sigobjs = []
for moltype, ksize in itertools.product(moltypes, ksizes):
merged = merge(
args.sigfiles, moltype=moltype, ksize=ksize, name=args.name
)
merged_sigobjs.append(merged)

with open(args.outsig, "wt") as f:
sourmash.save_signatures(merged_sigobjs, fp=f)
else:
# Otherwise, nothing to merge. Simply copy the file to the
# output signature location
shutil.copyfile(args.sigfiles[0], args.outsig)
154 changes: 154 additions & 0 deletions bin/validate_sketch_value.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/usr/bin/env python
from __future__ import print_function
from collections import OrderedDict, defaultdict, Counter
import logging
import argparse
import glob
import os
import sys


# Create a logger
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)


def get_sketch_value(value, value_log2):
try:
if value:
if "," in value:
logger.exception(
f"Can only provide a single number to --sketch_num_hashes or"
f" --sketch_scaled. Provided '{value}"
)
sketch_value = int(value)
else:
if "," in value_log2:
logger.exception(
f"Can only provide a single number to --sketch_num_hashes_log2 or"
f" --sketch_scaled_log2. Provided '{value_log2}"
)
sketch_value = 2 ** int(value_log2)
except ValueError:
logger.exception(
"Can only supply a single value to --sketch_num_hashes, "
"--sketch_num_hashes_log2, --sketch_scaled, "
"--sketch_scaled_log2 "
)
return sketch_value


def value_or_bool(value):
if value == "false":
return False
if value == "true":
logger.exception(
"Must set a value for the --sketch_num_hashes, "
"--sketch_num_hashes_log2, --sketch_scaled, "
"--sketch_scaled_log2 options! Cannot simply set the "
"flag. E.g. '--sketch_num_hashes 5' is valid but "
"'--sketch_num_hashes' on its own is not"
)
sys.exit(1)
else:
return value


def main(
sketch_num_hashes,
sketch_num_hashes_log2,
sketch_scaled,
sketch_scaled_log2,
out,
sketch_style,
):
sketch_num_hashes = value_or_bool(sketch_num_hashes)
sketch_num_hashes_log2 = value_or_bool(sketch_num_hashes_log2)
sketch_scaled = value_or_bool(sketch_scaled)
sketch_scaled_log2 = value_or_bool(sketch_scaled_log2)

using_size = sketch_num_hashes or sketch_num_hashes_log2
using_scaled = sketch_scaled or sketch_scaled_log2

if using_size and using_scaled:
logger.exception(
"Cannot specify both sketch scales and sizes! Can only"
" use one of --sketch_num_hashes, --sketch_num_hashes_log2, --sketch_scaled, "
"--sketch_scaled_log2. Exiting."
)
sys.exit(1)

if using_size:
if sketch_num_hashes and sketch_num_hashes_log2:
logger.exception(
"Cannot specify both --sketch_num_hashes and --sketch_num_hashes_log2! Exiting."
)
sys.exit(1)
sketch_value = get_sketch_value(sketch_num_hashes, sketch_num_hashes_log2)
with open(sketch_style, "w") as f:
f.write("size")
elif using_scaled:
if sketch_scaled and sketch_scaled_log2:
logger.exception(
"Cannot specify both --sketch_scaled and --sketch_scaled_log2! Exiting."
)
sys.exit(1)
sketch_value = get_sketch_value(sketch_scaled, sketch_scaled_log2)
with open(sketch_style, "w") as f:
f.write("scaled")

else:
logger.info(
"Did not specify a sketch size or scale with any of "
"--sketch_num_hashes, --sketch_num_hashes_log2, --sketch_scaled, --sketch_scaled_log2! "
"Falling back on sourmash's default of --sketch_scaled 500"
)
sketch_value = 500

with open(out, "w") as f:
f.write(f"{sketch_value}\n")


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="""Ensure that sketch sizes/scaleds provided are valid"""
)
parser.add_argument(
"--sketch_num_hashes", type=str, help="Flat size of the sketches"
)
parser.add_argument(
"--sketch_num_hashes_log2", type=str, help="Flat size of the sketches, log2"
)
parser.add_argument(
"--sketch_scaled", type=str, help="Fraction of total observed hashes to observe"
)
parser.add_argument(
"--sketch_scaled_log2",
type=str,
help="Fraction of total observed hashes to observe, log2",
)
parser.add_argument(
"-o",
"--output",
dest="output",
default="sketch_value.txt",
type=str,
help="file with output",
)
parser.add_argument(
"--sketch_style",
default="sketch_style.txt",
type=str,
help="file indicating 'size' or 'scaled'",
)

args = parser.parse_args()
main(
args.sketch_num_hashes,
args.sketch_num_hashes_log2,
args.sketch_scaled,
args.sketch_scaled_log2,
args.output,
args.sketch_style,
)
Loading