-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance when calculating sequence alignment #56
Comments
Just to be thorough:
No support for arbitrary data, I would say. Technically, we align¹/measure distance between a. sequences of grapheme clusters (which is not always the same as strings) and b. sequences of words. ¹ I would consider aligning as equivalent to "editops" Technically I believe any sequence alignment implementation would do if it a. supports substitutions and b. is optimal in the sense that it gives a shortest possible alignment. If we have that I believe we could just count for the edit distance calculation. However I can imagine dropping the alignment requirement if the user chooses to do that to improve the speed of calculating the metrics, as there are implementations that can do that calculation more efficiently (speed/memory) without providing the alignemnt. In my experience it is however incredibly important to have a manual look on the alignment (= the differences report) to get a sense of the problems and this is even more important with larger texts. (As I understand it, this could be done with RapidFuzz already because of optimizations) The second thing I can imagine is considering more efficient sequence alignment algorithms, that can easily handle larger texts. This is more problematic, as this is an evaluation tool. I deliberately chose an easily auditable textbook implementation of an optimal algorithm (optimal in the sense of shortest alignment = shortest edit distance) and I would hate to introduce vagueness in the sense of "This could be the CER but we don't know it for sure because of worst case behaviour of the algorithm". But maybe there is no way around this. For now, the best short term outcome would be if we get a "reasonable" performance for real world data I've seen. @cneud reported hours of runtime and 40 GB of memory usage for a newspaper page (a lot of text!) - That is certainly not "reasonable". I'll see if the license of the data allows uploading it. (The memory usage comes from 68188 characters of GT text, assuming roughly the same size for the OCR text and 64 bits to store an element of the Levenshtein matrix gives |
Another possibility: Efficiently calculate an optimal(!) CER, but the alignment might by calculated by some more efficient algorithm (not optimal, but fast) - with a warning for the user that the CER does not necessarily correspond exactly. |
No editops requiredFor any texts where the editops are not required the Levenshtein distance can be implemented using linear memory. Performance more important than memory usageFor cases where the Levenshtein matrix is actually required it could choose the type based upon
Most PCs will not even have enough memory to ever require uint64_t. E.g. in your example this would reduce the memory requirement by 50%. memory usage more important than performanceIt would be possible to be even more aggressive by creating a specialized container, which always has submatrices using the smaller datatype. This could reduce the requirement for a 68188x68188 matrix to:
This third approach could be selected either by an argument, or when the malloc for the faster implementation fails known upper thresholdWhen an upper threshold for the Levenshtein distance is defined, this can be used to reduce the memory usage in a similar way to the approach using Editops but not whole Levenshtein matrix requiredIt is possible to remove the common prefix and suffix of two strings before calculating the Levenshtein distance. As far as I could see you are often matching big documents which only have small differences, so this could significantly reduce the memory usage (and the runtime since removing the common parts is a O(N) operation, while it is pretty much free when there is no common affix). As far as I know this is done by python-Levenshtein as well. |
@mikegerber @b2m In the Levenshtein matrix there are often multiple optimal paths, while editops will just return one. Is there any reason to prefer one path over another? The simplest way would be:
python-Levenshtein extends this slightly by preferring multiple similar operations in a row:
I guess the reason for this path is, that python-Levenshtein allows the conversion into opcodes, which always specify a range for which the operation is used -> Multiple similar operations in a row can be combined into a single range, which results in a shorter list of opcodes. Is there any reason to prefer any of the paths? |
This is now supported in the next release (implemented just not released yet). > from rapidfuzz import string_metric
> string_metric.levenshtein(["aaaa", "bbbb"], ["aaaa", "cccc"])
1 note that this operates on the hashes of the strings. So this is the same as: > from rapidfuzz import string_metric
> string_metric.levenshtein([hash("aaaa"), hash("bbbb")], [hash("aaaa"), hash("cccc")])
1 So when two strings produce the same hash value they will be considered as equivalent |
@maxbachmann I have not checked the python-Levenshtein original code, but the Using hashing to be able to compare "arbitrary" sequences sounds sensible to me. Yes there will be collisions, and there are detailed StackOverflow questions/answers going into detail on when this happens. From a Python programmers perspective I would worry about comparing sequences with object that do not implement a proper |
In which cases would this help performance. My guess was that it has something to do with these weird conversions in python-Levenshtein from I will probably add another function to get the Levenshtein matrix in a space optimized way as well using some of the methods described above.
The only alternative would be using
Note that one special case is, that I directly use the corresponding unicode code-point for strings with a length of 1, so the following works: > from rapidfuzz import string_metric
> string_metric.levenshtein("aaaa", ["a", "a", "a", "a"])
0
> string_metric.levenshtein("aaaa", [ord("a"), ord("a"), ord("a"), ord("a")])
0 |
@b2m I released v1.5.0 of rapidfuzz yesterday, which does now support all sequences of hashable objects and editops: >>> from rapidfuzz.string_metric import levenshtein_editops
>>> for tag, src_pos, dest_pos in levenshtein_editops("qabxcd", "abycdf"):
... print(("%7s s1[%d] s2[%d]" % (tag, src_pos, dest_pos)))
delete s1[1] s2[0]
replace s1[4] s2[3]
insert s1[6] s2[6] |
Addition: performance tweak in the sense of determining ranges of inserts or deletes while determining the type of edit operation. The alternative would be to have a second pass through all the edit operations.
Thx for the heads up... should go to @mikegerber as he is the maintainer of this library.
I like the simplicity of the interface and the returned data! |
Awesome, I have been on vacation and I will check this out soon! |
I did some experiments with RapidFuzz and noticed a remaining problem with the positions returned by RapidFuzz's editops(). I had some similar problems with my own implementation - in fact they are still there - and so investigated it a bit. I think python-Levenshtein's positions make a lot of sense and this should be fixed; I opened an issue at rapidfuzz/RapidFuzz#148. Other than that: the new functions in RapidFuzz are exactly what we need in dinglehopper and we will use them 🥂 |
@maxbachmann was quick: PR rapidfuzz/rapidfuzz-cpp#53 fixes the problem. |
I just switched dinglehopper over to using RapidFuzz. Wow, 50x faster! I still want to do some profiling (duplicate conversions? duplicate/cacheable calls?) and more benchmarking but in essence this issue is fixed in the current version (= master). |
This is done, and with #72 (also in milestone 1.0) the remaining issue (duplicate calls) should be done, too. |
Dinglehopper is using a custom Python implementation of the Levenshtein distance to calculate, score and show an alignment of two given texts.
According to my performance analysis done for #47 the
distance
andeditops
functions of this custom implementation is the main bottleneck when comparing explicitly bad or big OCR results.In #48 I proposed to use the C based python-Levenshtein as a replacement, which we discarded for the following reasons:
One alternative and fast implementation for distance calculation is RapidFuzz, where @maxbachmann already started to adress the issue of the distance calculation for arbitrary sequences in rapidfuzz/RapidFuzz#100.
At the moment RapidFuzz is not supporting the calculcation of edit operations (see comment by @maxbachmann).
The text was updated successfully, but these errors were encountered: