-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored clustering code and new kmeans clustering criterion #19
Conversation
Place msa processing specific files in their own directory
* Place clustering code in own source file * Split it into functions * Individually test functions
* Rename 'aligned_seq' source file to 'prg_builder', as well as main object to PrgBuilder. * Add KMeans clustering tests including three that fail, touching on limits of current algorithm.
* Add thresholds * Add worker functions * Add function deciding whether a set of sequences is 'one-ref like' using worker functions * Tests of all added functions
* Add cluster extraction function, with unit tests * Use it in kmeans function * Use one-ref criterion for clustering instead of inertia * This solves previously introduced failing tests
* Add cluster merging function, with tests * Introduce a maximum number of clusters (10) beyond which each sequence placed in own cluster * Test it via clustering all 1024 DNA 5mers, expecting it to place each in own cluster without running kmeans on all possible k.
Rename msa subcommand from `prg_from_msa` to `from_msa`
* Kmeans clustering uses sequences without gaps * One-ref clustering criterion needs sequences with gaps, to compute hamming distance on sequences with same length. * Ensure the two sets of sequences are properly used for each purpose, with unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! Just some questions.
make_prg/io_utils.py
Outdated
from make_prg.prg_encoder import PrgEncoder, PRG_Ints | ||
|
||
|
||
def load_alignment_file(msa_file: str, alignment_format: str) -> MSA: | ||
msa_file = str(msa_file) | ||
logging.info("Read from MSA file %s", msa_file) | ||
if ".gz" in msa_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ".gz" in msa_file: | |
if msa_file.endswith(".gz"): |
Is probably much safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this!
tests/from_msa/test_prg_builder.py
Outdated
] | ||
) | ||
|
||
def test_get_subalignment_sequence_order_maintained2(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this test function is a bit ambiguous. What does 2 mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice spot- i am renaming for more clarity
* When reach MAX_CLUSTERS, use that number of clusters in output, rather than put each sequence in its own cluster. * Better gzip filename support * Also rename test functions for clarity
This PR has two main goals:
One important point is I've changed the command name
prg_from_msa
tofrom_msa
; this way we callmake_prg from_msa
. This may be annoying because will need to change scripts usingmake_prg
if we go ahead with this. I find it more descriptive and easier to type, but has to be weighed against changing calls in existing scripts/pipelines.Added
Modified
AlignedSeq
object and into own source file.from_msa
directoryAlignedSeq
object: now calledPrgBuilder
; same for source file. Is clear it builds from msa as located infrom_msa
directoryprg_from_msa
subcommand tofrom_msa
for less redundancy: now call asmake_prg from_msa
Test coverage
Goes from
in current
master
to
in this branch