Use pydivsufsort.common_substrings() by default to speed up match computation #16

fedarko · 2024-10-22T04:12:19Z

The time needed to compute the E. coli dot plot matrix used in the tutorial, for example, goes from ~3 minutes to ~30 seconds.

Still need to finish updating the benchmarking notebook and README. I'd like to also add some more tests / do some more sanity checks / compare peak memory usage with the old method (I seem to be getting more crashes...? Not sure if this is just because it's been a year since the last benchmarking, or if the new method uses more memory -- in which case maybe we could keep the old method around as an option for low-memory, slower-is-okay use cases).

Update: yeah, the new method requires more memory. So the old option is still available by passing sa_only to wp.DotPlotMatrix().

... which does pretty much the same thing we were doing before but faster (combines the strings before computing the suffix array rather than computing two separate suffix arrays; uses an LCP array; uses Cython; etc). See louisabraham/pydivsufsort#42. Just from some quick poking around in the benchmarking notebook, this speeds up things a lot. I think we can make it even faster by removing the match dicts entirely and just directly populating the COO matrix data from the common_substrings() outputs, though.

i am sure it is possible to avoid creating "matches" and instead update the COO data directly, but palindrome matches make this tricky (per the comment that remains in lines 241-246 in _make.py). hmm - i guess since the COO data is filled in in a certain order we could maybe use binary search or something to detect if a match already exists ...? but that will get pretty involved

why did i leave this stuff lowercase in the first place

moves bulky comment stuff out of the code, which is p simple

was checking pos != REV instead of md[pos] != REV... goofy. should really add a test or something with bogus "cs" object or something that verifies that this protects things. i guess this will necessitate splitting up this function a bit to make that easy to test

for some reason the 100M x 100M test stil fails - not sure why. i tried out running the notebook in both firefox and chrome, so that confirms that this is not a browser issue. next up, try running old code and see if the memory numbers look different and/or if the 100M x 100M test works

yeah i think this is the best solution. everybody wins this way

fedarko added 29 commits October 21, 2024 16:29

DOC: update readme / tutorial re: changes, "as wp"

8058bf9

DOC: readme update re perf (WIP)

0d801cf

DOC: fix logging capitalization & tidy up

1b5dfb7

why did i leave this stuff lowercase in the first place

DOC: why slow

8972a61

DOC: fix old func name typo

38a6df9

DOC/PERF: readme and tutorial updates re chgs :)

6d69fa4

MNT: rm now-unused test file

9ffce8d

DEP: rm now-unused pyfastx dev dependency

d449602

TST: add back previously-in-FASTA tests

9391824

DOC: note about suffix arrays / gepard

154cc3e

DEV: add "make ci" for convenience

b1f978c

DOC: explain matrix stuff in _make & upd8 dstring

260c732

moves bulky comment stuff out of the code, which is p simple

MNT/DOC: common_substrings() paranoia: see comment

b23f7f6

MNT: move getting common substrings to own func

bf85a3a

DOC: gcs output non-intersections

d8abd1d

TST: verify redundant matches don't cause BOTH

ae5da7a

DOC/TST: better descriptions of prev test

f2ce87b

STY/DOC: line lens, wording

44c85f0

DOC: more stuff abt palindromic vs. rev test

95efcfc

DEP/TST/DOC: use pytest-mock

5c7a048

DOC: rerun tutorial & update re: new performance

c748f2d

PERF/MNT/DOC/TST: Add back optional SA-only method

47957cd

yeah i think this is the best solution. everybody wins this way

STY

bc55e6d

DOC: update benchmarking and tutorial ntbks

906fe54

DOC/PERF: update docs re sa_only

61fa485

fedarko changed the title ~~Use pydivsufsort.common_substrings() to speed up match computation~~ Use pydivsufsort.common_substrings() to speed up default match computation Dec 28, 2024

fedarko changed the title ~~Use pydivsufsort.common_substrings() to speed up default match computation~~ Use pydivsufsort.common_substrings() by default to speed up match computation Dec 28, 2024

fedarko merged commit ee548e8 into main Dec 28, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pydivsufsort.common_substrings() by default to speed up match computation #16

Use pydivsufsort.common_substrings() by default to speed up match computation #16

fedarko commented Oct 22, 2024 •

edited

Loading

Use pydivsufsort.common_substrings() by default to speed up match computation #16

Use pydivsufsort.common_substrings() by default to speed up match computation #16

Conversation

fedarko commented Oct 22, 2024 • edited Loading

fedarko commented Oct 22, 2024 •

edited

Loading