Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pydivsufsort.common_substrings() by default to speed up match computation #16

Merged
merged 29 commits into from
Dec 28, 2024

Conversation

fedarko
Copy link
Owner

@fedarko fedarko commented Oct 22, 2024

The time needed to compute the E. coli dot plot matrix used in the tutorial, for example, goes from ~3 minutes to ~30 seconds.

Still need to finish updating the benchmarking notebook and README. I'd like to also add some more tests / do some more sanity checks / compare peak memory usage with the old method (I seem to be getting more crashes...? Not sure if this is just because it's been a year since the last benchmarking, or if the new method uses more memory -- in which case maybe we could keep the old method around as an option for low-memory, slower-is-okay use cases).

Update: yeah, the new method requires more memory. So the old option is still available by passing sa_only to wp.DotPlotMatrix().

... which does pretty much the same thing we were doing before
but faster (combines the strings before computing the suffix array
rather than computing two separate suffix arrays; uses an LCP
array; uses Cython; etc). See
louisabraham/pydivsufsort#42.

Just from some quick poking around in the benchmarking notebook,
this speeds up things a lot. I think we can make it even faster
by removing the match dicts entirely and just directly populating
the COO matrix data from the common_substrings() outputs, though.
i am sure it is possible to avoid creating "matches" and instead
update the COO data directly, but palindrome matches make this
tricky (per the comment that remains in lines 241-246 in _make.py).
hmm - i guess since the COO data is filled in in a certain order
we could maybe use binary search or something to detect if a match
already exists ...? but that will get pretty involved
why did i leave this stuff lowercase in the first place
moves bulky comment stuff out of the code, which is p simple
was checking pos != REV instead of md[pos] != REV... goofy.

should really add a test or something with bogus "cs" object or
something that verifies that this protects things. i guess this
will necessitate splitting up this function a bit to make that
easy to test
for some reason the 100M x 100M test stil fails - not sure why.

i tried out running the notebook in both firefox and chrome, so that
confirms that this is not a browser issue. next up, try running old
code and see if the memory numbers look different and/or if the
100M x 100M test works
yeah i think this is the best solution. everybody wins this way
@fedarko fedarko changed the title Use pydivsufsort.common_substrings() to speed up match computation Use pydivsufsort.common_substrings() to speed up default match computation Dec 28, 2024
@fedarko fedarko changed the title Use pydivsufsort.common_substrings() to speed up default match computation Use pydivsufsort.common_substrings() by default to speed up match computation Dec 28, 2024
@fedarko fedarko merged commit ee548e8 into main Dec 28, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant