-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pydivsufsort.common_substrings() by default to speed up match computation #16
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
... which does pretty much the same thing we were doing before but faster (combines the strings before computing the suffix array rather than computing two separate suffix arrays; uses an LCP array; uses Cython; etc). See louisabraham/pydivsufsort#42. Just from some quick poking around in the benchmarking notebook, this speeds up things a lot. I think we can make it even faster by removing the match dicts entirely and just directly populating the COO matrix data from the common_substrings() outputs, though.
i am sure it is possible to avoid creating "matches" and instead update the COO data directly, but palindrome matches make this tricky (per the comment that remains in lines 241-246 in _make.py). hmm - i guess since the COO data is filled in in a certain order we could maybe use binary search or something to detect if a match already exists ...? but that will get pretty involved
why did i leave this stuff lowercase in the first place
moves bulky comment stuff out of the code, which is p simple
was checking pos != REV instead of md[pos] != REV... goofy. should really add a test or something with bogus "cs" object or something that verifies that this protects things. i guess this will necessitate splitting up this function a bit to make that easy to test
for some reason the 100M x 100M test stil fails - not sure why. i tried out running the notebook in both firefox and chrome, so that confirms that this is not a browser issue. next up, try running old code and see if the memory numbers look different and/or if the 100M x 100M test works
yeah i think this is the best solution. everybody wins this way
fedarko
changed the title
Use pydivsufsort.common_substrings() to speed up match computation
Use pydivsufsort.common_substrings() to speed up default match computation
Dec 28, 2024
fedarko
changed the title
Use pydivsufsort.common_substrings() to speed up default match computation
Use pydivsufsort.common_substrings() by default to speed up match computation
Dec 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The time needed to compute the E. coli dot plot matrix used in the tutorial, for example, goes from ~3 minutes to ~30 seconds.
Still need to finish updating the benchmarking notebook and README. I'd like to also add some more tests / do some more sanity checks / compare peak memory usage with the old method (I seem to be getting more crashes...? Not sure if this is just because it's been a year since the last benchmarking, or if the new method uses more memory -- in which case maybe we could keep the old method around as an option for low-memory, slower-is-okay use cases).
Update: yeah, the new method requires more memory. So the old option is still available by passing
sa_only
towp.DotPlotMatrix()
.