Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better default filtering options for ska distance #82

Merged
merged 6 commits into from
Sep 25, 2024

Conversation

johnlees
Copy link
Member

  • Remove ambig sites by default
  • Link to filtering tutorial in docs
  • Slightly expand description of ska distance in the docs

Closes #81

Remove ambig sites by default
Link to filtering tutorial
Slightly expand description of ska distance in the docs
@johnlees
Copy link
Member Author

@rderelle would you mind checking whether the code on this branch gives the right result on your example from #81? Should just be runnable with ska distance -v <skf_file> (i.e. no flags added)

Copy link

codecov bot commented Sep 25, 2024

Codecov Report

Attention: Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.22%. Comparing base (b9c8909) to head (cc496f8).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
src/cli.rs 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #82      +/-   ##
==========================================
+ Coverage   93.18%   93.22%   +0.04%     
==========================================
  Files          16       16              
  Lines        2260     2259       -1     
==========================================
  Hits         2106     2106              
+ Misses        154      153       -1     
Flag Coverage Δ
93.22% <91.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rderelle
Copy link
Collaborator

@johnlees

I tried different parameter combinations. ska gives the expected answer with -m 0.9. Shouldn't the minimum frequency be 0.9 by default instead of 0 (as in ska align)?

ska distance -v out_4.8.2.skf > default.txt
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-25T15:09:18.535Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-25T15:09:19.879Z INFO [ska::merge_ska_array] Filtering removed 3395629 split k-mers
2024-09-25T15:09:19.879Z INFO [ska::generic_modes] Applying filters: threshold=0 constant_site_filter=No constant sites or ambiguous bases filter_ambig_as_missing=true ambig_mask=true no_gap_only_sites=false
2024-09-25T15:09:19.879Z INFO [ska::merge_ska_array] Updating variant counts
2024-09-25T15:09:20.037Z INFO [ska::merge_ska_array] Removed 4384 empty variants
2024-09-25T15:09:20.591Z INFO [ska::merge_ska_array] Filtering removed 73 split k-mers
2024-09-25T15:09:20.731Z INFO [ska::merge_ska_array] Masked 2003 ambiguous bases (non-A/C/G/T/U/N/-) with 'N'
2024-09-25T15:09:20.731Z INFO [ska::generic_modes] Calculating distances

-> 7-60 SNPs (with one distance = 1200 SNPs)

ska distance -v out_4.8.2.skf -m 0.9 > m0.9.txt
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-25T15:10:07.056Z INFO [ska::generic_modes] Applying filters: threshold=50 constant_site_filter=No constant sites filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-25T15:10:08.057Z INFO [ska::merge_ska_array] Filtering removed 3395629 split k-mers
2024-09-25T15:10:08.057Z INFO [ska::generic_modes] Applying filters: threshold=50 constant_site_filter=No constant sites or ambiguous bases filter_ambig_as_missing=true ambig_mask=true no_gap_only_sites=false
2024-09-25T15:10:08.057Z INFO [ska::merge_ska_array] Updating variant counts
2024-09-25T15:10:08.115Z INFO [ska::merge_ska_array] Removed 0 empty variants
2024-09-25T15:10:08.295Z INFO [ska::merge_ska_array] Filtering removed 48 split k-mers
2024-09-25T15:10:08.332Z INFO [ska::merge_ska_array] Masked 28 ambiguous bases (non-A/C/G/T/U/N/-) with 'N'
2024-09-25T15:10:08.332Z INFO [ska::generic_modes] Calculating distances

-> 0-24 SNPs ... that's the good one :)

ska distance -v out_4.8.2.skf -m 0.9 --allow-ambiguous > m0.9_allamb.txt
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-25T15:12:04.541Z INFO [ska::generic_modes] Applying filters: threshold=50 constant_site_filter=No constant sites filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-25T15:12:05.546Z INFO [ska::merge_ska_array] Filtering removed 3395629 split k-mers
2024-09-25T15:12:05.546Z INFO [ska::generic_modes] Applying filters: threshold=50 constant_site_filter=No filtering filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-25T15:12:05.566Z INFO [ska::merge_ska_array] Filtering removed 0 split k-mers
2024-09-25T15:12:05.566Z INFO [ska::generic_modes] Calculating distances

-> 15-39 SNPs

@johnlees
Copy link
Member Author

Ok, that's great input, thanks! I will update that default for min freq then merge this

@rderelle
Copy link
Collaborator

rderelle commented Sep 25, 2024

While it gives the expected distances, I'm surprised that the command ska distance -v out_4.8.2.skf -m 0.9 filters out less split k-mers than the command ska distance -v out_4.8.2.skf. Maybe a very minor error in the verbose mode (?).

@johnlees
Copy link
Member Author

While it gives the expected distances, I'm surprised that the command ska distance -v out_4.8.2.skf -m 0.9 filters out less split k-mers than the command ska distance -v out_4.8.2.skf. Maybe a very minor error in the verbose mode (?).

Ah yeah, I think I see the error. Want to try again with the most recent commit?

@rderelle
Copy link
Collaborator

rderelle commented Sep 25, 2024

It now makes more sense. :)

ska distance -v out_4.8.2.skf -m 0.9 > m0.9.txt
SKA: Split K-mer Analysis (the alignment-free aligner)
2024-09-25T15:44:39.977Z INFO [ska::generic_modes] Applying filters: threshold=50 constant_site_filter=No constant sites filter_ambig_as_missing=false ambig_mask=false no_gap_only_sites=false
2024-09-25T15:44:40.958Z INFO [ska::merge_ska_array] Filtering removed 4928769 split k-mers
2024-09-25T15:44:40.958Z INFO [ska::generic_modes] Applying filters: threshold=50 constant_site_filter=No constant sites or ambiguous bases filter_ambig_as_missing=true ambig_mask=true no_gap_only_sites=false
2024-09-25T15:44:40.958Z INFO [ska::merge_ska_array] Updating variant counts
2024-09-25T15:44:41.016Z INFO [ska::merge_ska_array] Removed 0 empty variants
2024-09-25T15:44:41.200Z INFO [ska::merge_ska_array] Filtering removed 91 split k-mers
2024-09-25T15:44:41.234Z INFO [ska::merge_ska_array] Masked 28 ambiguous bases (non-A/C/G/T/U/N/-) with 'N'
2024-09-25T15:44:41.234Z INFO [ska::generic_modes] Calculating distances

@johnlees
Copy link
Member Author

Ok great, this is looking good now

@johnlees johnlees merged commit 55667e8 into master Sep 25, 2024
8 of 9 checks passed
@johnlees johnlees deleted the i81_distances_ambig branch September 25, 2024 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorrect calling # of SNPs
2 participants