Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with filter_by_abundance() #157

Open
charlesfoster opened this issue Aug 23, 2022 · 5 comments
Open

Error with filter_by_abundance() #157

charlesfoster opened this issue Aug 23, 2022 · 5 comments

Comments

@charlesfoster
Copy link

Hi,

I'm running MitoZ through singularity after converting the docker container into an image (.sif). The pipeline runs well on some samples, but on 'weaker' samples with low abundance of mitochondrial DNA, the pipeline crashes with an error:

2022-08-23 13:54:19,126 - mitoz.utility.utility - WARNING - 
filter_by_abundance():
All sequences are low abundance (<10X)
Traceback (most recent call last):
  File "/app/anaconda/bin/mitoz", line 10, in <module>
    sys.exit(main())
  File "/app/anaconda/lib/python3.9/site-packages/mitoz/MitoZ.py", line 88, in main
    args.func(args)
  File "/app/anaconda/lib/python3.9/site-packages/mitoz/all/all.py", line 289, in main
    args.fastafiles, assemble_all_result_wdir = mitoz.assemble.main(assemble_args)
  File "/app/anaconda/lib/python3.9/site-packages/mitoz/assemble/assembly.py", line 480, in main
    assembly_file, mt_file = use_megahit(
  File "/app/anaconda/lib/python3.9/site-packages/mitoz/assemble/assembly.py", line 388, in use_megahit
    mt_file = findmitoscaf.main(args)
  File "/app/anaconda/lib/python3.9/site-packages/mitoz/findmitoscaf/findmitoscaf.py", line 444, in main
    selected_mt_fa = select_sequences_by_gene_completeness(
  File "/app/anaconda/lib/python3.9/site-packages/mitoz/findmitoscaf/findmitoscaf.py", line 248, in select_sequences_by_gene_completeness
    hmm_outf_tbl_besthit_sim_fasta_like + ".sorted"
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Command used:

mitoz all  \
--skip_filter \
 --outprefix 78_484_S11  \
--thread_number 8      \
--clade Arthropoda   \
--genetic_code auto   \
--species_name "Unknown" \
--fq1 78_484_S11_mito_R1.fq.gz  \
--fq2 78_484_S11_mito_R2.fq.gz   \
--fastq_read_length 150   \
--assembler megahit  \
--kmers_megahit 39 59 79 99 119 141         \
--memory 50        \
 --requiring_taxa 'Arthropoda' >> 78_484_S11.mitoz.log 2>&1

Thanks

@linzhi2013
Copy link
Owner

linzhi2013 commented Aug 23, 2022

Hi,

there is a message All sequences are low abundance (<10X), which means that all these sequences won't be passed to the subsequent mitochondrial-sequence selection steps. That is why you got TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' error in the end.

Maybe you can try smaller kmers or different assemblers, or provide more input data (if you have one).

Or you can set a lower value of --min_abundance (the default is 10, which means 10X as in the log message), although I do not recommend it, because they could be false positives. A more quicker way is to blast the file mt_assembly/megahit/78_484_S11.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta against NCBI NT database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) and check if there are potential mitochondrial sequences. If there are, you must further verify them by yourself.

Cheers

@charlesfoster
Copy link
Author

Thanks for the reply - that makes sense.

I wonder if, as a feature request, it might be possible to change the error handling in these circumstances? A TypeError from Python isn't entirely intuitive for figuring out what went wrong in the analysis. Some suggestions might be to not print the traceback, but instead print a message like "Mitochondrial sequence selection cannot be run: all sequences are low abundance (<10X). Quitting."

Also, just an additional idea: in these situations of failure, could the expected output files be generated regardless, but just be empty? The reason I ask is that I've put together a pretty basic snakemake workflow that incorporates MitoZ, and when the analysis quits with a non-zero exit code in these situations, it causes the workflow to quit. The necessary workarounds in snakemake to handle bash exceptions are currently a bit ugly:

rule mitoz_assembly:
    input:
        r1 = os.path.join(RESULT_DIR, "{sample}/{sample}_mito_R1.fq.gz"),
        r2 = os.path.join(RESULT_DIR, "{sample}/{sample}_mito_R2.fq.gz")
    output:
        ckp = os.path.join(RESULT_DIR, "{sample}/assembly/{sample}.mitozComplete.txt"),
        complete_file = os.path.join(RESULT_DIR, "{sample}/mitoz/{sample}.result/{sample}.{sample}.megahit.mitogenome.fa.result/summary.txt"),
        mitoz_assembly = os.path.join(RESULT_DIR, "{sample}/mitoz/{sample}.result/{sample}.{sample}.megahit.result/{sample}.megahit.mitogenome.fa"),
    params: 
        sample = "{sample}",
        outdir = os.path.join(RESULT_DIR, "{sample}/assembly"),
        workdir = os.path.join(RESULT_DIR, "{sample}","mitoz"),
        touchdir1 = os.path.join(RESULT_DIR, "{sample}/mitoz/{sample}.result/{sample}.{sample}.megahit.result/"),
        touchdir2 = os.path.join(RESULT_DIR, "{sample}/mitoz/{sample}.result/{sample}.{sample}.megahit.mitogenome.fa.result/"),
    message:"assembling {wildcards.sample}"
    threads: 8
    log:
        os.path.join(RESULT_DIR,"{sample}/{sample}.mitoz.log")
    resources:
        mem_mb=30000,
        cpus=8
    container:
        "docker://guanliangmeng/mitoz:3.4"
    shell:
        """
        mitoz all \
        --workdir {params.workdir} \
        --skip_filter \
        --outprefix {wildcards.sample} \
        --thread_number 8 \
        --clade Arthropoda \
        --genetic_code auto \
        --species_name "Unknown" \
        --fq1 {input.r1} \
        --fq2 {input.r2} \
        --fastq_read_length 150 \
        --data_size_for_mt_assembly 0 \
        --assembler megahit \
        --kmers_megahit 39 59 79 99 119 141 \
        --memory 50 \
        --requiring_taxa 'Arthropoda' >> {log} 2>&1 || mkdir -p {params.touchdir1} && mkdir -p mkdir -p {params.touchdir2} && touch {output.complete_file} && touch {output.mitoz_assembly}

        touch {output.ckp}
        """

Finally, would it be possible to 'clean' the output directory names a little by default? Currently when specifying a working directory (mitoz) and an outprefix (78_484_S11), the results directories have the outprefix in their names twice and are quite complex, for example the summary.txt file is found in: ./mitoz/74_498_S10.result/74_498_S10.74_498_S10.megahit.mitogenome.fa.result/.

All good if these can't be implemented now/at all - just some ideas! Thanks for the great tool regardless.

Charles

@linzhi2013
Copy link
Owner

Hi Charles,

Thanks for your suggestion!

I wonder if, as a feature request, it might be possible to change the error handling in these circumstances? A TypeError from Python isn't entirely intuitive for figuring out what went wrong in the analysis. Some suggestions might be to not print the traceback, but instead print a message like "Mitochondrial sequence selection cannot be run: all sequences are low abundance (<10X). Quitting."

Indeed, we need a better error-handing method. Your suggestion is great. This can be implemented.

Also, just an additional idea: in these situations of failure, could the expected output files be generated regardless, but just be empty?

Maybe we can generate an empty resulting directory in this case? To me, it would seem a bit complicated to generate (empty) individual resulting files, because the findmitoscaf module is used in different situations.

./mitoz/74_498_S10.result/74_498_S10.74_498_S10.megahit.mitogenome.fa.result/

Yes, I also don't like this naming format, will try to fix it later.

These can be fixed in the next release.

Thanks a lot for your feedback!

Cheers
Guanliang

@YahuiMen
Copy link

YahuiMen commented Mar 12, 2024

Hi,

there is a message All sequences are low abundance (<10X), which means that all these sequences won't be passed to the subsequent mitochondrial-sequence selection steps. That is why you got TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' error in the end.

Maybe you can try smaller kmers or different assemblers, or provide more input data (if you have one).

Or you can set a lower value of --min_abundance (the default is 10, which means 10X as in the log message), although I do not recommend it, because they could be false positives. A more quicker way is to blast the file mt_assembly/megahit/78_484_S11.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta against NCBI NT database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) and check if there are potential mitochondrial sequences. If there are, you must further verify them by yourself.

Cheers

I used version 3.6 installed with docker and encountered the same problem. I also tried different solutions, but the problem was not solved.
The strange thing is that in my results this file is empty.
Below is my code and the problem I encountered.
image
Snipaste_2024-03-12_15-44-42
Snipaste_2024-03-12_15-45-36

Thank you for your help.

@linzhi2013
Copy link
Owner

linzhi2013 commented Mar 12, 2024

image this file is only 3.6 Kb in size. It seems there are only a few mt sequences assembled. you may try to use smaller kmers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants