-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kmerfinder update and optimization #170
Comments
Hi @SchwarzMarek,
The pipeline should work with the unpacked form as well. It only performs the untar operation when the kmerfinder database is provided in
Yes, there seems to be an issue with the Zenodo URL that worked with nf-core/bacass version v2.3.1. Have you tried using
Super interesting! Let me look into it next week (I'm currently trying to release version 2.4.0, as the
Sure, I’m open to discussing this! I think it would definitely enhance the pipeline as well. We should open a specific issue for this and work on it there. What do you think? |
Hi, ad 1) Thanks for clarification; when I've tried passing a directory It didn't work (I've used relative path); it works when I pass absolute path. On the previous issue with canu (#164) I can confirm, that it now works in Best |
Hi, When I'm processing multiple genomes at once, The issue is that I see 4 different genomes in I would expect to see references for The zip file contains Best |
How do I pass the database if I have installed it via when I pass the path to my directory I installed it in I get an error. Any help is appreciated! |
Hi, bacass/modules/local/kmerfinder.nf Line 27 in c81202b
Maybe, since this is bacteria assembly pipeline, it was not expected to work with viral databases. Best MS |
Hello @SchwarzMarek,
The behavior you are experience might be related to how reference genomes are selected during the porcess. Summary of the methodology: 1. Kmerfinder process
bacass/subworkflows/local/kmerfinder_subworkflow.nf Lines 50 to 62 in 3f6a42d
2. FIND_DOWNLOAD_REFERENCE process.
You may see four reference genomes listed in the KmerFinder summary, but only three reference genomes used by Best |
Hi @Daniel-VM , I've opened new issue for the reference genome download as suggested previously #172 and I'm closing this. Thanks for the answers :). |
Description of feature
I very much appreciate the functionality implemented with kmerfinder (that is automatic search for close genome and running Quast with it). However I'm running into several issues with current implementation in
bacass
Kmerdb must be provided as
tar.gz
-> this leads to excessive storage usage and need to unpack the archive on each run of the pipeline (without-resume
).I suggest to allow to pass the db directory in unpacked form.
The only kmerdb, which I've found to work is exactly the one stated in the
bacass
documentation, ((dated 2019/01/08) https://zenodo.org/records/10458361/files/20190108_kmerfinder_stable_dirs.tar.gz
) however, according to zenodo, this is malformed and updated version of the db is deposited at zenodo, which however, appears not to work with the pipeline. More over, this database is quite old; newer versions ofkmerfinder
dbs are deposited atftp://ftp.cbs.dtu.dk/public/CGE/databases/KmerFinder/version/
, latest there appears to be from 10/2021 (also oldish). Even more recent is accessiblehttps://cge.food.dtu.dk/services/KmerFinder/
from 2022 (haven't tested yet, 63GB download).The need to provide
--ncbi_assembly_metadata
(which are updated by ncbi) leads to inconsistencies between the metadata and kmerfinder db, when assembly is made obsolete (check the venn diagram from the database and current ncbi refseq assembly metedata). I can see, that it would be problematic to have 100% 1:1 correspondence, as the updates to NCBI are frequent, but now, the pipeline fails when the best-match-assembly is not present in the metadata (I've encountered this with my data and that's why I've started digging around). Beside updating the database I have few ideas on how to obtain the assembly without need to refer to the metadata file:a) in the zenodo db, in the
bacteria.name
there is complete assembly id (col 3) which can be used to construct the download link directly. (This will fix some cases, assuppressed
records are still available albeit not present in the metadata table).b) in newer kmerfinder dbs there is
bacteria.tax
which containassembly id
(also can be extracted frombacteria.name
col 3), which can be used inI'm also wondering if similar functionality could be implemented with kraken2 (and its database), so one could have one (possibly larger) database and use it for contamination screen and most similar genome identification...
I do not have experience in writing
nextflow
pipelines, but I'm willing to write some python scripts e.g. for interacting with NCBI api.The text was updated successfully, but these errors were encountered: