incorrect calling # of SNPs #81

PWSmit · 2024-09-16T10:48:43Z

Hi,
Tried to validate SKA2 to current methods by comparing clustering abilities using this dataset:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8549354/. Supplementary tabel S2 contains detailed info.
tried various filter settings (little effect), K 31 and default (17 I believe) with major effect.

according to clustering reported in the paper, there should be plenty of clusters (even according to SKA1) but now the lowest has a few hundred SNPs. should I then use <1000 snps as a cut-off for calling identical? seems very odd. Advice is appreciated.
thanks,
Pieter

k-31 output:
Sample1 Sample2 Distance Mismatches
VRE01 VRE02 253.00 0.01965
VRE01 VRE03 285.67 0.02038
VRE01 VRE04 301.67 0.02102
VRE01 VRE05 2093.25 0.22315
VRE01 VRE06 2281.50 0.26828
VRE01 VRE07 2271.00 0.26410
VRE01 VRE08 2292.17 0.26630
VRE01 VRE09 2711.08 0.32633
VRE01 VRE10 2408.00 0.31685
VRE01 VRE11 2137.67 0.27616
VRE01 VRE12 247.50 0.01924
VRE01 VRE13 2301.67 0.28808
VRE01 VRE14 2272.50 0.26346
VRE01 VRE15 2242.00 0.26199
VRE01 VRE16 229.33 0.00975
k-17 output:
Sample1 Sample2 Distance Mismatches
VRE01 VRE02 938.00 0.01617
VRE01 VRE03 988.25 0.01657
VRE01 VRE04 1020.25 0.01717
VRE01 VRE05 4294.58 0.19870
VRE01 VRE06 4780.00 0.24028
VRE01 VRE07 4748.33 0.23723
VRE01 VRE08 4799.83 0.23887
VRE01 VRE09 5553.08 0.29795
VRE01 VRE10 5094.67 0.29039
VRE01 VRE11 4423.75 0.24879
VRE01 VRE12 938.33 0.01582
VRE01 VRE13 4976.08 0.26186
VRE01 VRE14 4758.83 0.23660
VRE01 VRE15 4664.83 0.23577
VRE01 VRE16 937.50 0.00718
code and v output below.

(base) pietersmit@mbpvanpieter2 VRE dataset % ska build -f samples.tsv -k 31 -o KP_ska_index --min-count 3 --threads 4
SKA: Split K-mer Analysis (the alignment-free aligner)
SKA done in 288s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
(base) pietersmit@mbpvanpieter2 VRE dataset % ska distance -o distances2.txt KP_ska_index.skf --filter-ambiguous
SKA: Split K-mer Analysis (the alignment-free aligner)
SKA done in 7s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
(base) pietersmit@mbpvanpieter2 VRE dataset % ska build -f samples.tsv -o KP_ska_index --min-count 3 --threads 4 -v
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-16T10:26:29.294Z INFO [ska] k=17: using 64-bit representation
2024-09-16T10:26:29.294Z INFO [ska::merge_ska_dict] Building skf dicts from sequence input
2024-09-16T10:26:29.294Z INFO [ska::merge_ska_dict] FASTQ files filtered with: min count: 3; minimum quality 20 (5); filter applied: Whole k-mer quality filtering
2024-09-16T10:26:29.295Z INFO [ska::merge_ska_dict] Build and merge skf dicts in parallel using 4 threads
2024-09-16T10:31:42.224Z INFO [ska::generic_modes] Converting to array representation and saving
SKA done in 315s
⬛⬜⬛⬜⬛⬜⬛
⬜⬛⬜⬛⬜⬛⬜
2024-09-16T10:31:44.956Z INFO [ska] Complete
(base) pietersmit@mbpvanpieter2 VRE dataset % ska distance -o distances3.txt KP_ska_index.skf --filter-ambiguous -v
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-16T10:32:07.749Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-16T10:32:08.405Z INFO [ska::merge_ska_array] Filtering removed 2057265 split k-mers
2024-09-16T10:32:08.405Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites or ambiguous bases filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-16T10:32:08.889Z INFO [ska::merge_ska_array] Filtering removed 13568 split k-mers
2024-09-16T10:32:08.889Z INFO [ska::generic_modes] Calculating distances
SKA done in 6s

PWSmit · 2024-09-16T10:49:53Z

(base) pietersmit@mbpvanpieter2 VRE dataset % ska -V
ska 0.3.10

johnlees · 2024-09-16T10:59:28Z

Some initial thoughts:

There are more filtering options you could include, see this post: https://www.bacpop.org/guides/snp_alignment_with_ska/. Have a look at the alignments produced by ska align in e.g. seaview too, to see if you have obvious problems with gaps or ambiguous sites.
Also see the --min-freq filter, 0.9 is a sensible default. I think ska distance supports this, but if not you can run ska weed on your skf file first.
If you also have assemblies available, it might be sensible to validate that your filtering of the fastq data was sufficient by doing a build with fasta input too.

PWSmit · 2024-09-16T14:09:58Z

thanks for prompt reply!
ran align as suggested (ska align --no-gap-only-sites --filter-ambig-as-missing --filter no-ambig-or-const KP_ska_index.skf >VRE.fa), doesnt seem to be too odd (see 2 printscreens below).

also attached the fasta file below. when running distance again with your suggestion, -v showed this:

pietersmit@mbpvanpieter2 VRE dataset % ska distance -o distances5.txt KP_ska_index.skf --filter-ambiguous --min-freq 0.9 -v
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-16T13:40:36.837Z INFO [ska::generic_modes] Applying filters: threshold=36 constant_site_filter=No constant sites filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-16T13:40:37.260Z INFO [ska::merge_ska_array] Filtering removed 1978942 split k-mers
2024-09-16T13:40:37.260Z INFO [ska::generic_modes] Applying filters: threshold=36 constant_site_filter=No constant sites or ambiguous bases filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-16T13:40:37.313Z INFO [ska::merge_ska_array] Filtering removed 1229 split k-mers
2024-09-16T13:40:37.313Z INFO [ska::generic_modes] Calculating distances

we got the output lowered (to almost a reasonable number) but is still way to high for what is expected. there are no other filter options with distance as far as --help stated. below the top of the output file.

Sample1 Sample2 Distance Mismatches
VRE01 VRE02 128.50 0.00055
VRE01 VRE03 129.50 0.00055
VRE01 VRE04 131.50 0.00055
VRE01 VRE05 1263.50 0.01216
VRE01 VRE06 1902.00 0.00058
VRE01 VRE07 1901.00 0.00054
VRE01 VRE08 1901.00 0.00050
VRE01 VRE09 2133.67 0.02631
VRE01 VRE10 2083.50 0.01599
VRE01 VRE11 1760.00 0.01124
VRE01 VRE12 127.00 0.00055
VRE01 VRE13 1912.50 0.00050
VRE01 VRE14 1912.00 0.00050
VRE01 VRE15 1890.00 0.00212
VRE01 VRE16 27.50 0.00057

suggestions are welcome why this is not 0 snps as expected. thanks!
VRE.txt

johnlees · 2024-09-16T14:25:25Z

suggestions are welcome why this is not 0 snps as expected. thanks!

Can you clarify why zero SNPs are expected here? I've only had a quick scan of the paper and supplementary table you mention, but it only looks like some are expected to be in the same sequence type (which is usually within ~0.5% genome divergence, which can be thousands or tens of thousands of SNPs).

rderelle · 2024-09-16T16:18:18Z

regarding the issue associated with distance estimates, I believe it corresponds to a previous reported issue: #69

These distances might be overestimated due to ambiguous nucleotides (a gut feeling; I've not checked). According to the alignment, the distance VRE01 VRE02 should be 4 or 5, not 128.

rderelle · 2024-09-16T16:46:38Z

nb: according to Table 4 of this paper, VRE01-VRE02-VRE03-VRE04 are part of a 5-SNPs cluster.

rderelle · 2024-09-18T18:26:44Z

another example and a possible solution

Here is a skf file built from 55 Mtb isolates belonging to the same 12-SNPs cluster of transmission (available here: https://drive.google.com/drive/folders/1OWEQtq0UdnzZMPEIi-X0O_wxqEPn1mOo?usp=sharing). The SNP alignment is composed of about 90 SNPs, with lot of isolates separated by 0,1,2 or 3 SNPs.

When distances are computed with the command 'ska distance out_4.8.2.skf -m 0.9 --filter-ambiguous', distances are here again a bit overestimated, with distances ranging from 5.25 to 31. Here are the first 12 lines of the output:

Sample1 Sample2 Distance Mismatches
ERR046926 ERR072044 6.75 0.01393
ERR046926 ERR072046 7.75 0.01682
ERR046926 ERR072047 6.75 0.01323
ERR046926 ERR072048 8.75 0.01371
ERR046926 ERR2512672 7.50 0.01736
ERR046926 ERR2513864 11.50 0.01256
ERR046926 ERR2513968 10.50 0.01258
ERR046926 ERR2514094 9.50 0.02025
ERR046926 ERR2514103 13.50 0.01257
ERR046926 ERR2514145 9.25 0.01262
ERR046926 ERR2514147 9.25 0.01275
ERR046926 ERR2514208 11.50 0.01266

My assumption was that the '--filter-ambiguous' argument was incorrectly passed on to the function variant_dist in the merge_ska_array.rs. Therefore, I modified the following code from the file generic_modes.rs (starting at line 142; modifications correspond to 2 'true' and original code in comments):

let mask_ambig = false;
let ignore_const_gaps = false;
let filter_ambig_as_missing = false;
let constant = apply_filters(
    ska_array,
    min_freq,
    filter_ambig_as_missing,
    &FilterType::NoConst,
    mask_ambig,
    ignore_const_gaps,
);
if filt_ambig || (min_freq * ska_array.nsamples() as f64 >= 1.0) {
    if filt_ambig {
        apply_filters(
            ska_array,
            min_freq,
            true,  //filter_ambig_as_missing,
            &FilterType::NoAmbigOrConst,
            true,  //mask_ambig,
            ignore_const_gaps,
        );
    } else {
        apply_filters(
            ska_array,
            min_freq,
            filter_ambig_as_missing,
            &FilterType::NoFilter,
            mask_ambig,
            ignore_const_gaps,
        );
    }
}

After recompiling ska, I now obtain distances ranging from 0 to 19, with lot of distances between 0 and 3 as expected. A few distances are still decimals (eg, 4.75) but it isn't too surprising since the SNP alignment contains a few ambiguous nucleotides. Here are the first 12 lines of the output:

Sample1 Sample2 Distance Mismatches
ERR046926 ERR072044 3.00 0.01393
ERR046926 ERR072046 4.00 0.01682
ERR046926 ERR072047 3.00 0.01322
ERR046926 ERR072048 5.00 0.01371
ERR046926 ERR2512672 3.00 0.01736
ERR046926 ERR2513864 7.00 0.01256
ERR046926 ERR2513968 6.00 0.01258
ERR046926 ERR2514094 5.00 0.02025
ERR046926 ERR2514103 9.00 0.01257
ERR046926 ERR2514145 4.75 0.01262
ERR046926 ERR2514147 4.75 0.01274
ERR046926 ERR2514208 7.00 0.01266

@johnlees, could this be a solution?

johnlees · 2024-09-20T10:23:52Z

Yeah thanks Romain, this looks like the right fix. I'll try and get this in and document this clearly next week

PWSmit · 2024-09-20T12:47:03Z

nice!
question; if you filter ambiguous nucleotides, how come you still have ambiguous nucleotides in your outcome, Romain?

rderelle · 2024-09-20T14:00:31Z

@PWSmit This fix essentially mimics the 2 arguments "--filter-ambig-as-missing --filter no-ambig-or-const" of the module ska align (see the post pointed out by John). Here, distances ignore:
_ split k-mers that contain a lot of ambiguous nucleotides (ie, spit k-mers repeated in the genome) by considering ambiguous nucleotides as missing data -> split k-mers ignored due to the --min-fre filter
_ split k-mers that contain middle base differing solely by the presence of an ambiguous character or absent in one sample (ie, no ATGC variation).

Therefore, this filtering only keeps split k-mers with ATGC variations, excluding positions with lot of ambiguous nucleotides, similar to the alignment you posted. However, there can also be ambiguous nucleotides in these split kmers ... see VRE26 in your alignment (possibly a mixture of strains).

johnlees · 2024-09-25T13:44:22Z

I appreciate the use in ska distance is quite misleading at present, as the filters are based on ska align. Really I think we just want to be able to filter on split k-mer frequence and whether to include ambiguous bases or not. I will address this in #82

johnlees mentioned this issue Sep 25, 2024

Better default filtering options for ska distance #82

Merged

johnlees closed this as completed in #82 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect calling # of SNPs #81

incorrect calling # of SNPs #81

PWSmit commented Sep 16, 2024

PWSmit commented Sep 16, 2024

johnlees commented Sep 16, 2024

PWSmit commented Sep 16, 2024

johnlees commented Sep 16, 2024

rderelle commented Sep 16, 2024 •

edited

Loading

rderelle commented Sep 16, 2024 •

edited

Loading

rderelle commented Sep 18, 2024 •

edited

Loading

johnlees commented Sep 20, 2024

PWSmit commented Sep 20, 2024

rderelle commented Sep 20, 2024 •

edited

Loading

johnlees commented Sep 25, 2024

incorrect calling # of SNPs #81

incorrect calling # of SNPs #81

Comments

PWSmit commented Sep 16, 2024

PWSmit commented Sep 16, 2024

johnlees commented Sep 16, 2024

PWSmit commented Sep 16, 2024

johnlees commented Sep 16, 2024

rderelle commented Sep 16, 2024 • edited Loading

rderelle commented Sep 16, 2024 • edited Loading

rderelle commented Sep 18, 2024 • edited Loading

johnlees commented Sep 20, 2024

PWSmit commented Sep 20, 2024

rderelle commented Sep 20, 2024 • edited Loading

johnlees commented Sep 25, 2024

rderelle commented Sep 16, 2024 •

edited

Loading

rderelle commented Sep 16, 2024 •

edited

Loading

rderelle commented Sep 18, 2024 •

edited

Loading

rderelle commented Sep 20, 2024 •

edited

Loading