-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from custom Levenshtein to python-Levenshtein #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! To be honest this would be a lot easier to review if you hadn't refactored all of the relevant tests.
Yes... I refactored the tests (1st commit) to be able to test the original implementation and the new implementation in parallel (2nd commit) and then replace the original implementation (3rd commit). I then messed it up during the second commit because I overlooked some tests. So you can still go back to the second commit (fa617e8) and see both implementations work in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Learned something new (https://pypi.org/project/multimethod/), well-tested and summarized. Great PR IMHO.
I had a Gitter chat iwth @b2m: I'll 1. cherry pick the test refactoring and 2. see if I can come up with a Cython implementation of the Wagner-Fisher algorithm - because I am not so super happy with the proposed (Best solution: Add a generic solution to python-Levenshtein) |
There is another remark I should make: There are other sequence matching algorithms and I am not sure yet if I want them in here because they do not necessarily output the optimal solution. |
I agree that the best solution would be a contribution to the python-Levenshtein package. |
This is on the agenda for 2021-01, sorry for not doing it earlier :) Merry christmas! (I actually have a naive Cython implementation which is faster than the pure Python implementation. More on that in January!) |
As the distance and editops calculation is a performance bottleneck in this application we substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package. We now also have separate entrypoints for texts with unicode normalization and without because this also can be done more efficiently once upon preprocessing.
One thing to keep in mind is that python-Levenshtein is GPL licensed. When using an external Levenshtein distance implementation it might better to use a different library with a more fitting license. I did a performance comparision in the documentation of my own library (RapidFuzz) https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein and afaik there are multiple implementations, which are more open licensed and faster than python-Levenshtein. |
@maxbachmann Thanks, I will definitely check it out! Does your implementation allow for arbitrary sequences or is it limited to |
@maxbachmann To elaborate a bit: In dinglehopper, we need to align both sequences of Unicode characters and also sequences of words. (We need to align those, too, not only compute the edit distance.) When aligning strings (sequences of Unicode characters) we also need to address a problem that comes up especially when dealing with OCR from historic prints: characters that cannot be represented with a single codepoints. For example, in "Schlym̃" there is a code point sequence/grapheme cluster LATIN SMALL LETTER M + COMBINING TILDE that cannot be normalized to a single code point. This need to be treated as one "character" when aligning/calculating in one way or another. |
|
Alright, I decided something:
|
We still have the GPL-Problem @maxbachmann mentioned. Dinglehopper is licensed as Apache 2 and therefore can not use a GPL licensed python module (AFAIK). I would consider it a showstopper for this PR and have not found a suitable replacement that has the same functionality (edit operations + edit distance). |
I am currently working on this. Will let you know when it is added. |
@b2m I'll investigate this. I hate licensing problems :-) (My last review comment is weird, it's old and I don't know how to delete it :) ) @maxbachmann Awesome! |
While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL, too. So the python-Levensthein library cannot be used, unfortunately. (Using GPL tools in e.g. Makefiles seems to be fine, though, as these tools are used as separate programs.) I'll make my team aware of this problem. @cneud @kba This might be of interest for the OCR-D community, too. Side not about GPLv3: As I understand this page, we also can't use GPLv3 licensed libraries in our Apache-licensed software. (It would work the other way around, but AFAICT only with GPLv3, not GPLv2.) |
GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as. So it is unclear how the GPL has to be applied in Python. However it is definitely simpler to use a library with a more open license, since it completely avoids those issues. As a note e.g. FuzzyWuzzy was originally MIT licensed, but got GPL licensed because they use python-Levenshtein and were unsure about this as well. |
I am almost convinced that importing a Python library is essentially dynamic linking but I agree it is not 100% clear. However, I have no doubt in this case because of the SO I just checked all other (first level) dependencies and all are licensed with less strict licenses (MIT, Apache, BSD). So I think we will go down this route for this aspect of the software as well. (Not discussed with the team yet, but I doubt there is anyone in my team who wants to fist-fight the Free Software Foundation over this.) |
@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =) Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket? |
I am fairly certain that we're not tainted by license incompatibilities anywhere but it would be nice to be more than "fairly certain". So if you do set up something like this, let me know, it might be useful to do this as part of the CI/CD of ocrd_all. |
I've reopened #54 for this!
Yes, that would be nice! |
Closing this pull request and opened #56 to keep track of a more appropriate solution. |
As the
distance
andeditops
calculation is a performance bottleneck in this application I substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package.We now also have separate entrypoints for texts with unicode normalization and without. For example when calculating the flexible character accuracy in #47 the normalization can be done more efficiently once upon preprocessing.
Here are some benchmarks for the behaviour before and after this change using data already available in the repository:
Note that with this changes we can only compare unicode documents with a total maximum of 1 114 111 unique characters (or words). The limit comes from Python's chr function. But I suspect this should not be an issue at the moment or the near future.