-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect calling # of SNPs #81
Comments
(base) pietersmit@mbpvanpieter2 VRE dataset % ska -V |
Some initial thoughts:
|
thanks for prompt reply! pietersmit@mbpvanpieter2 VRE dataset % ska distance -o distances5.txt KP_ska_index.skf --filter-ambiguous --min-freq 0.9 -v we got the output lowered (to almost a reasonable number) but is still way to high for what is expected. there are no other filter options with distance as far as --help stated. below the top of the output file. Sample1 Sample2 Distance Mismatches suggestions are welcome why this is not 0 snps as expected. thanks! |
Can you clarify why zero SNPs are expected here? I've only had a quick scan of the paper and supplementary table you mention, but it only looks like some are expected to be in the same sequence type (which is usually within ~0.5% genome divergence, which can be thousands or tens of thousands of SNPs). |
regarding the issue associated with distance estimates, I believe it corresponds to a previous reported issue: #69 These distances might be overestimated due to ambiguous nucleotides (a gut feeling; I've not checked). According to the alignment, the distance VRE01 VRE02 should be 4 or 5, not 128. |
nb: according to Table 4 of this paper, VRE01-VRE02-VRE03-VRE04 are part of a 5-SNPs cluster. |
another example and a possible solution Here is a skf file built from 55 Mtb isolates belonging to the same 12-SNPs cluster of transmission (available here: https://drive.google.com/drive/folders/1OWEQtq0UdnzZMPEIi-X0O_wxqEPn1mOo?usp=sharing). The SNP alignment is composed of about 90 SNPs, with lot of isolates separated by 0,1,2 or 3 SNPs. When distances are computed with the command 'ska distance out_4.8.2.skf -m 0.9 --filter-ambiguous', distances are here again a bit overestimated, with distances ranging from 5.25 to 31. Here are the first 12 lines of the output:
My assumption was that the '--filter-ambiguous' argument was incorrectly passed on to the function variant_dist in the merge_ska_array.rs. Therefore, I modified the following code from the file generic_modes.rs (starting at line 142; modifications correspond to 2 'true' and original code in comments):
After recompiling ska, I now obtain distances ranging from 0 to 19, with lot of distances between 0 and 3 as expected. A few distances are still decimals (eg, 4.75) but it isn't too surprising since the SNP alignment contains a few ambiguous nucleotides. Here are the first 12 lines of the output:
@johnlees, could this be a solution? |
Yeah thanks Romain, this looks like the right fix. I'll try and get this in and document this clearly next week |
nice! |
@PWSmit This fix essentially mimics the 2 arguments "--filter-ambig-as-missing --filter no-ambig-or-const" of the module ska align (see the post pointed out by John). Here, distances ignore: Therefore, this filtering only keeps split k-mers with ATGC variations, excluding positions with lot of ambiguous nucleotides, similar to the alignment you posted. However, there can also be ambiguous nucleotides in these split kmers ... see VRE26 in your alignment (possibly a mixture of strains). |
I appreciate the use in ska distance is quite misleading at present, as the filters are based on |
Hi,
Tried to validate SKA2 to current methods by comparing clustering abilities using this dataset:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8549354/. Supplementary tabel S2 contains detailed info.
tried various filter settings (little effect), K 31 and default (17 I believe) with major effect.
according to clustering reported in the paper, there should be plenty of clusters (even according to SKA1) but now the lowest has a few hundred SNPs. should I then use <1000 snps as a cut-off for calling identical? seems very odd. Advice is appreciated.
thanks,
Pieter
k-31 output:
Sample1 Sample2 Distance Mismatches
VRE01 VRE02 253.00 0.01965
VRE01 VRE03 285.67 0.02038
VRE01 VRE04 301.67 0.02102
VRE01 VRE05 2093.25 0.22315
VRE01 VRE06 2281.50 0.26828
VRE01 VRE07 2271.00 0.26410
VRE01 VRE08 2292.17 0.26630
VRE01 VRE09 2711.08 0.32633
VRE01 VRE10 2408.00 0.31685
VRE01 VRE11 2137.67 0.27616
VRE01 VRE12 247.50 0.01924
VRE01 VRE13 2301.67 0.28808
VRE01 VRE14 2272.50 0.26346
VRE01 VRE15 2242.00 0.26199
VRE01 VRE16 229.33 0.00975
k-17 output:
Sample1 Sample2 Distance Mismatches
VRE01 VRE02 938.00 0.01617
VRE01 VRE03 988.25 0.01657
VRE01 VRE04 1020.25 0.01717
VRE01 VRE05 4294.58 0.19870
VRE01 VRE06 4780.00 0.24028
VRE01 VRE07 4748.33 0.23723
VRE01 VRE08 4799.83 0.23887
VRE01 VRE09 5553.08 0.29795
VRE01 VRE10 5094.67 0.29039
VRE01 VRE11 4423.75 0.24879
VRE01 VRE12 938.33 0.01582
VRE01 VRE13 4976.08 0.26186
VRE01 VRE14 4758.83 0.23660
VRE01 VRE15 4664.83 0.23577
VRE01 VRE16 937.50 0.00718
code and v output below.
(base) pietersmit@mbpvanpieter2 VRE dataset % ska build -f samples.tsv -k 31 -o KP_ska_index --min-count 3 --threads 4
SKA: Split K-mer Analysis (the alignment-free aligner)
SKA done in 288s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
(base) pietersmit@mbpvanpieter2 VRE dataset % ska distance -o distances2.txt KP_ska_index.skf --filter-ambiguous
SKA: Split K-mer Analysis (the alignment-free aligner)
SKA done in 7s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
(base) pietersmit@mbpvanpieter2 VRE dataset % ska build -f samples.tsv -o KP_ska_index --min-count 3 --threads 4 -v
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-16T10:26:29.294Z INFO [ska] k=17: using 64-bit representation
2024-09-16T10:26:29.294Z INFO [ska::merge_ska_dict] Building skf dicts from sequence input
2024-09-16T10:26:29.294Z INFO [ska::merge_ska_dict] FASTQ files filtered with: min count: 3; minimum quality 20 (5); filter applied: Whole k-mer quality filtering
2024-09-16T10:26:29.295Z INFO [ska::merge_ska_dict] Build and merge skf dicts in parallel using 4 threads
2024-09-16T10:31:42.224Z INFO [ska::generic_modes] Converting to array representation and saving
SKA done in 315s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
2024-09-16T10:31:44.956Z INFO [ska] Complete
(base) pietersmit@mbpvanpieter2 VRE dataset % ska distance -o distances3.txt KP_ska_index.skf --filter-ambiguous -v
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-16T10:32:07.749Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-16T10:32:08.405Z INFO [ska::merge_ska_array] Filtering removed 2057265 split k-mers
2024-09-16T10:32:08.405Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites or ambiguous bases filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-16T10:32:08.889Z INFO [ska::merge_ska_array] Filtering removed 13568 split k-mers
2024-09-16T10:32:08.889Z INFO [ska::generic_modes] Calculating distances
SKA done in 6s
The text was updated successfully, but these errors were encountered: