Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kat comp finds specific kmers between 2 fastq files with the same reads not given in the same order (with a reproducible example) #188

Open
jfouret opened this issue May 16, 2024 · 0 comments

Comments

@jfouret
Copy link

jfouret commented May 16, 2024

Hi,

Thank you for this tool. I wanted to use kat comp somehow to validate an other tool for looseless compression of fastq reads. This tool is reordering the reads but should be looseless. I was suprised to see specific kmers after decompression hence to confirm that it's not an artefact I wanted to confirm that when I give 2 identical set of reads to kat comp, but not in the same order, I would have 0 specific kmers.

However that is not the confirmation I had. Maybe I mad a mistake somewhere or there may be artifacts in kat comp.

Below is a code to reproduce my results:

SRR=SRR14237206
apptainer run docker://ncbi/sra-tools prefetch $SRR
apptainer run docker://ncbi/sra-tools fasterq-dump $SRR \
  --split-files --progress
pigz -p 8 ${SRR}_* 
mkdir fastq ; mv ${SRR}_*.fastq.gz fastq/
mkdir shuffle
apptainer run docker://staphb/seqkit seqkit shuffle fastq/${SRR}_1.fastq.gz --out-file shuffle/${SRR}_1.fastq.gz
apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz

I got those results:

$ apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz
INFO:    Using cached SIF image
Kmer Analysis Toolkit (KAT) V2.4.1

Running KAT in COMP mode
------------------------

Input 1 is a sequence file.  Counting kmers for input 1 (fastq/SRR14237206_1.fastq.gz) ... done.  Time taken: 32.2s

Input 2 is a sequence file.  Counting kmers for input 2 (shuffle/SRR14237206_1.fastq.gz) ... done.  Time taken: 34.2s

Comparing hashes ... done.  Time taken: 27.0s

Merging results ... done.  Time taken: 0.7s

Saving results to disk ... done.  Time taken: 0.3s


Summary statistics
------------------

K-mer statistics for: 
 - Hash 1: "fastq/SRR14237206_1.fastq.gz"
 - Hash 2: "shuffle/SRR14237206_1.fastq.gz"

Total K-mers in: 
 - Hash 1: 1464945516
 - Hash 2: 1464945516

Distinct K-mers in:
 - Hash 1: 364458959
 - Hash 2: 364458959

Total K-mers only found in:
 - Hash 1: 0
 - Hash 2: 131916277

Distinct K-mers only found in:
 - Hash 1: 0
 - Hash 2: 129068475

Shared K-mers:
 - Total shared found in hash 1: 1464945516
 - Total shared found in hash 2: 1464945516
 - Distinct shared K-mers: 364458959

Distance between spectra 1 and 2 (all k-mers):
 - Manhattan distance: 0
 - Euclidean distance: 0
 - Cosine distance: 1.11022e-16
 - Canberra distance: 0
 - Jaccard distance: 0

Distance between spectra 1 and 2 (shared k-mers):
 - Manhattan distance: 0
 - Euclidean distance: 0
 - Cosine distance: 1.11022e-16
 - Canberra distance: 0
 - Jaccard distance: 0

Creating plot(s) ... done.  Time taken: 1.3s

Analysing peaks for spectra copy number matrix
----------------------------------------------

Analysing distributions for: kat-comp-main.mx ... 
Analysing full spectra
No peaks detected for full spectra.  Can't continue.
done.  Time taken:  0.0s

Main spectra statistics
-----------------------
K-value used: 27
Peaks in analysis: 0
Global minima @ Frequency=2x (1420224)
Global maxima @ Frequency=9x (10974317)
Overall mean k-mer frequency: 0x

No peaks detected

Calculating genome statistics
-----------------------------
No peaks detected, so no genome stats to report
Estimated assembly completeness: Unknown

Creating plots
--------------

No peaks in K-mer frequency histogram.  Not plotting.


KAT COMP completed.
Total runtime: 96.7s

What I do not understand is that :

Total K-mers only found in:
 - Hash 1: 0
 - Hash 2: 131916277 <=============================================

Distinct K-mers only found in:
 - Hash 1: 0
 - Hash 2: 129068475  <=============================================

Thank you,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant