Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored clustering code and new kmeans clustering criterion #19

Merged
merged 9 commits into from
Nov 23, 2020

Conversation

bricoletc
Copy link
Member

This PR has two main goals:

  • Continue the refactoring effort. This time on clustering code, placed in own source files, and added many more tests.
  • Address part of Ambiguous path production #15 , by reducing ambiguous paths (distinct paths in prg with same sequence). From empirical tests on TB and Pf datasets, it does reduce this problem significantly, and improves downstream genotyping (in gramtools)

One important point is I've changed the command name prg_from_msa to from_msa; this way we call make_prg from_msa. This may be annoying because will need to change scripts using make_prg if we go ahead with this. I find it more descriptive and easier to type, but has to be weighed against changing calls in existing scripts/pipelines.

Added

  • Unit tests for refactored kmeans clustering functions.
  • Add failing clustering tests: all sequences very close getting grouped into clusters, sequences with three clusters getting grouped into two clusters
  • New kmer clustering algorithm based on 'one-ref' principle, replacing inertia-halving criterion. Idea: if a group of sequences is similar to their majority sequence, do not need to be clustered. Similar defined via heuristic on %identity. This solves above failing cases and does not fail existing (eg integration) tests.

Modified

  • Refactoring of kmeans clustering: take clustering code out of main AlignedSeq object and into own source file.
  • Kmeans clustering split into more functions, for eg counting distinct kmers, producing kmer count matrix, extracting clusters, merging clusters.
  • Place prg from msa source files in own from_msa directory
  • Refactoring of AlignedSeq object: now called PrgBuilder; same for source file. Is clear it builds from msa as located in from_msa directory
  • Rename prg_from_msa subcommand to from_msa for less redundancy: now call as make_prg from_msa

Test coverage

Goes from

Name                                   Stmts   Miss  Cover   Missing
--------------------------------------------------------------------
make_prg/__init__.py                      11      2    82%   13-14
make_prg/interval_partition.py           124      2    98%   67, 132
make_prg/io_utils.py                      99     79    20%   15-18, 36-39, 42-62, 67-126, 133-140, 153-178
make_prg/make_prg_from_msa.py            196     30    85%   70-74, 202-203, 290-297, 319, 343-374, 378-381
make_prg/prg_encoder.py                   40      0   100%
make_prg/seq_utils.py                     48      4    92%   77-84
make_prg/subcommands/__init__.py           2      0   100%
make_prg/subcommands/prg_from_msa.py      43     37    14%   9-68, 72-128
--------------------------------------------------------------------
TOTAL                                    563    154    73%

in current master

to

Name                                      Stmts   Miss  Cover   Missing
-----------------------------------------------------------------------
make_prg/__init__.py                          6      2    67%   6-7
make_prg/from_msa/__init__.py                 4      0   100%
make_prg/from_msa/cluster_sequences.py      141      0   100%
make_prg/from_msa/interval_partition.py     124      2    98%   67, 132
make_prg/from_msa/prg_builder.py            120     28    77%   67-71, 158-165, 187, 211-242, 246-249
make_prg/io_utils.py                        100     79    21%   16-19, 37-40, 43-63, 68-127, 134-141, 154-179
make_prg/prg_encoder.py                      40      0   100%
make_prg/seq_utils.py                        50      4    92%   81-88
make_prg/subcommands/__init__.py              1      0   100%
-----------------------------------------------------------------------
TOTAL                                       586    115    80%

in this branch

Place msa processing specific files in their own directory
* Place clustering code in own source file
* Split it into functions
* Individually test functions
* Rename 'aligned_seq' source file to 'prg_builder', as well as main object to PrgBuilder.
* Add KMeans clustering tests including three that fail, touching on limits of current algorithm.
* Add thresholds
* Add worker functions
* Add function deciding whether a set of sequences is 'one-ref like'
  using worker functions
* Tests of all added functions
* Add cluster extraction function, with unit tests
* Use it in kmeans function
* Use one-ref criterion for clustering instead of inertia
* This solves previously introduced failing tests
* Add cluster merging function, with tests
* Introduce a maximum number of clusters (10) beyond which each
  sequence placed in own cluster
* Test it via clustering all 1024 DNA 5mers, expecting it to
  place each in own cluster without running kmeans on all possible k.
Rename msa subcommand from `prg_from_msa` to `from_msa`
* Kmeans clustering uses sequences without gaps
* One-ref clustering criterion needs sequences with gaps, to compute hamming
  distance on sequences with same length.
* Ensure the two sets of sequences are properly used for each purpose, with unit tests.
@bricoletc bricoletc requested review from rmcolq and mbhall88 November 18, 2020 10:02
Copy link
Member

@mbhall88 mbhall88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! Just some questions.

make_prg/from_msa/cluster_sequences.py Show resolved Hide resolved
make_prg/from_msa/cluster_sequences.py Show resolved Hide resolved
from make_prg.prg_encoder import PrgEncoder, PRG_Ints


def load_alignment_file(msa_file: str, alignment_format: str) -> MSA:
msa_file = str(msa_file)
logging.info("Read from MSA file %s", msa_file)
if ".gz" in msa_file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ".gz" in msa_file:
if msa_file.endswith(".gz"):

Is probably much safer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this!

tests/from_msa/test_prg_builder.py Show resolved Hide resolved
tests/from_msa/test_prg_builder.py Show resolved Hide resolved
]
)

def test_get_subalignment_sequence_order_maintained2(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this test function is a bit ambiguous. What does 2 mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice spot- i am renaming for more clarity

* When reach MAX_CLUSTERS, use that number of clusters in output,
rather than put each sequence in its own cluster.

* Better gzip filename support

* Also rename test functions for clarity
@bricoletc bricoletc merged commit 2d7557a into master Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants