Refactored clustering code and new kmeans clustering criterion #19

bricoletc · 2020-11-18T10:02:36Z

This PR has two main goals:

Continue the refactoring effort. This time on clustering code, placed in own source files, and added many more tests.
Address part of Ambiguous path production #15 , by reducing ambiguous paths (distinct paths in prg with same sequence). From empirical tests on TB and Pf datasets, it does reduce this problem significantly, and improves downstream genotyping (in gramtools)

One important point is I've changed the command name prg_from_msa to from_msa; this way we call make_prg from_msa. This may be annoying because will need to change scripts using make_prg if we go ahead with this. I find it more descriptive and easier to type, but has to be weighed against changing calls in existing scripts/pipelines.

Added

Unit tests for refactored kmeans clustering functions.
Add failing clustering tests: all sequences very close getting grouped into clusters, sequences with three clusters getting grouped into two clusters
New kmer clustering algorithm based on 'one-ref' principle, replacing inertia-halving criterion. Idea: if a group of sequences is similar to their majority sequence, do not need to be clustered. Similar defined via heuristic on %identity. This solves above failing cases and does not fail existing (eg integration) tests.

Modified

Refactoring of kmeans clustering: take clustering code out of main AlignedSeq object and into own source file.
Kmeans clustering split into more functions, for eg counting distinct kmers, producing kmer count matrix, extracting clusters, merging clusters.
Place prg from msa source files in own from_msa directory
Refactoring of AlignedSeq object: now called PrgBuilder; same for source file. Is clear it builds from msa as located in from_msa directory
Rename prg_from_msa subcommand to from_msa for less redundancy: now call as make_prg from_msa

Test coverage

Goes from

Name                                   Stmts   Miss  Cover   Missing
--------------------------------------------------------------------
make_prg/__init__.py                      11      2    82%   13-14
make_prg/interval_partition.py           124      2    98%   67, 132
make_prg/io_utils.py                      99     79    20%   15-18, 36-39, 42-62, 67-126, 133-140, 153-178
make_prg/make_prg_from_msa.py            196     30    85%   70-74, 202-203, 290-297, 319, 343-374, 378-381
make_prg/prg_encoder.py                   40      0   100%
make_prg/seq_utils.py                     48      4    92%   77-84
make_prg/subcommands/__init__.py           2      0   100%
make_prg/subcommands/prg_from_msa.py      43     37    14%   9-68, 72-128
--------------------------------------------------------------------
TOTAL                                    563    154    73%

in current master

to

Name                                      Stmts   Miss  Cover   Missing
-----------------------------------------------------------------------
make_prg/__init__.py                          6      2    67%   6-7
make_prg/from_msa/__init__.py                 4      0   100%
make_prg/from_msa/cluster_sequences.py      141      0   100%
make_prg/from_msa/interval_partition.py     124      2    98%   67, 132
make_prg/from_msa/prg_builder.py            120     28    77%   67-71, 158-165, 187, 211-242, 246-249
make_prg/io_utils.py                        100     79    21%   16-19, 37-40, 43-63, 68-127, 134-141, 154-179
make_prg/prg_encoder.py                      40      0   100%
make_prg/seq_utils.py                        50      4    92%   81-88
make_prg/subcommands/__init__.py              1      0   100%
-----------------------------------------------------------------------
TOTAL                                       586    115    80%

in this branch

Place msa processing specific files in their own directory

* Place clustering code in own source file * Split it into functions * Individually test functions

* Rename 'aligned_seq' source file to 'prg_builder', as well as main object to PrgBuilder. * Add KMeans clustering tests including three that fail, touching on limits of current algorithm.

* Add thresholds * Add worker functions * Add function deciding whether a set of sequences is 'one-ref like' using worker functions * Tests of all added functions

* Add cluster extraction function, with unit tests * Use it in kmeans function * Use one-ref criterion for clustering instead of inertia * This solves previously introduced failing tests

* Add cluster merging function, with tests * Introduce a maximum number of clusters (10) beyond which each sequence placed in own cluster * Test it via clustering all 1024 DNA 5mers, expecting it to place each in own cluster without running kmeans on all possible k.

Rename msa subcommand from `prg_from_msa` to `from_msa`

* Kmeans clustering uses sequences without gaps * One-ref clustering criterion needs sequences with gaps, to compute hamming distance on sequences with same length. * Ensure the two sets of sequences are properly used for each purpose, with unit tests.

mbhall88

Nice PR! Just some questions.

make_prg/from_msa/cluster_sequences.py

mbhall88 · 2020-11-19T01:26:36Z

make_prg/io_utils.py

 from make_prg.prg_encoder import PrgEncoder, PRG_Ints


 def load_alignment_file(msa_file: str, alignment_format: str) -> MSA:
+    msa_file = str(msa_file)
    logging.info("Read from MSA file %s", msa_file)
    if ".gz" in msa_file:


Suggested change

if ".gz" in msa_file:

if msa_file.endswith(".gz"):

Is probably much safer

adding this!

tests/from_msa/test_prg_builder.py

mbhall88 · 2020-11-19T01:39:57Z

tests/from_msa/test_prg_builder.py

+            ]
+        )
+
+    def test_get_subalignment_sequence_order_maintained2(self):


The name of this test function is a bit ambiguous. What does 2 mean?

nice spot- i am renaming for more clarity

* When reach MAX_CLUSTERS, use that number of clusters in output, rather than put each sequence in its own cluster. * Better gzip filename support * Also rename test functions for clarity

bricoletc added 8 commits November 18, 2020 09:42

Refactor source files

0b9d096

Place msa processing specific files in their own directory

Refactor sequence clustering

7162950

* Place clustering code in own source file * Split it into functions * Individually test functions

Refactoring + add failing clustering tests

0267c45

* Rename 'aligned_seq' source file to 'prg_builder', as well as main object to PrgBuilder. * Add KMeans clustering tests including three that fail, touching on limits of current algorithm.

One reference based clustering (1)

8d9329e

* Add thresholds * Add worker functions * Add function deciding whether a set of sequences is 'one-ref like' using worker functions * Tests of all added functions

One reference based clustering (2)

fa6f649

* Add cluster extraction function, with unit tests * Use it in kmeans function * Use one-ref criterion for clustering instead of inertia * This solves previously introduced failing tests

Rename subcommand

367b808

Rename msa subcommand from `prg_from_msa` to `from_msa`

bricoletc requested review from rmcolq and mbhall88 November 18, 2020 10:02

mbhall88 reviewed Nov 19, 2020

View reviewed changes

Bugfixes: including MAX_CLUSTERS capping

2d7557a

* When reach MAX_CLUSTERS, use that number of clusters in output, rather than put each sequence in its own cluster. * Better gzip filename support * Also rename test functions for clarity

bricoletc force-pushed the dev branch from d57b5fe to 2d7557a Compare November 22, 2020 14:36

mbhall88 approved these changes Nov 23, 2020

View reviewed changes

bricoletc merged commit 2d7557a into master Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored clustering code and new kmeans clustering criterion #19

Refactored clustering code and new kmeans clustering criterion #19

bricoletc commented Nov 18, 2020

mbhall88 left a comment

mbhall88 Nov 19, 2020

bricoletc Nov 22, 2020

mbhall88 Nov 19, 2020

bricoletc Nov 22, 2020

Refactored clustering code and new kmeans clustering criterion #19

Refactored clustering code and new kmeans clustering criterion #19

Conversation

bricoletc commented Nov 18, 2020

Added

Modified

Test coverage

mbhall88 left a comment

Choose a reason for hiding this comment

mbhall88 Nov 19, 2020

Choose a reason for hiding this comment

bricoletc Nov 22, 2020

Choose a reason for hiding this comment

mbhall88 Nov 19, 2020

Choose a reason for hiding this comment

bricoletc Nov 22, 2020

Choose a reason for hiding this comment