Switch from custom Levenshtein to python-Levenshtein #48

b2m · 2020-11-16T15:51:40Z

As the distance and editops calculation is a performance bottleneck in this application I substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package.

We now also have separate entrypoints for texts with unicode normalization and without. For example when calculating the flexible character accuracy in #47 the normalization can be done more efficiently once upon preprocessing.

Here are some benchmarks for the behaviour before and after this change using data already available in the repository:

Command	Mean [s]	Min [s]	Max [s]	Relative
before brochrnx_73075507X	3.359 ± 0.039	3.305	3.408	8.29 ± 0.13
after brochrnx_73075507X	0.405 ± 0.004	0.399	0.409	1.00
before actevedef_718448162 CALAMARI	34.410 ± 0.561	33.918	35.362	84.91 ± 1.61
after actevedef_718448162 CALAMARI	0.926 ± 0.010	0.911	0.935	2.28 ± 0.03
before actevedef_718448162 TESS	34.103 ± 0.305	33.685	34.529	84.16 ± 1.11
after actevedef_718448162 TESS	0.909 ± 0.008	0.899	0.921	2.24 ± 0.03

Statistics as JSON
For details on how I created the benchmarks see the gist Performance benchmarking for dinglehopper using hyperfine.
This also adresses point 1 in try cutting corners to become faster #44.

Note that with this changes we can only compare unicode documents with a total maximum of 1 114 111 unique characters (or words). The limit comes from Python's chr function. But I suspect this should not be an issue at the moment or the near future.

mikegerber

Thanks for the contribution! To be honest this would be a lot easier to review if you hadn't refactored all of the relevant tests.

b2m · 2020-11-17T11:31:27Z

Thanks for the contribution! To be honest this would be a lot easier to review if you hadn't refactored all of the relevant tests.

Yes... I refactored the tests (1st commit) to be able to test the original implementation and the new implementation in parallel (2nd commit) and then replace the original implementation (3rd commit).

I then messed it up during the second commit because I overlooked some tests.

So you can still go back to the second commit (fa617e8) and see both implementations work in parallel.

kba

Learned something new (https://pypi.org/project/multimethod/), well-tested and summarized. Great PR IMHO.

mikegerber · 2020-11-17T14:39:51Z

I had a Gitter chat iwth @b2m: I'll 1. cherry pick the test refactoring and 2. see if I can come up with a Cython implementation of the Wagner-Fisher algorithm - because I am not so super happy with the proposed transform_lists. If that is not working out, I'll try to be happy with the encode-list-elements-as-strings hack ;-)

(Best solution: Add a generic solution to python-Levenshtein)

mikegerber · 2020-11-17T14:42:32Z

There is another remark I should make: There are other sequence matching algorithms and I am not sure yet if I want them in here because they do not necessarily output the optimal solution.

b2m · 2020-11-17T15:03:17Z

(Best solution: Add a generic solution to python-Levenshtein)

I agree that the best solution would be a contribution to the python-Levenshtein package.
But I lack the experience in C to do this on a production grade level 😞.

mikegerber · 2020-12-15T17:29:15Z

This is on the agenda for 2021-01, sorry for not doing it earlier :) Merry christmas!

(I actually have a naive Cython implementation which is faster than the pure Python implementation. More on that in January!)

As the distance and editops calculation is a performance bottleneck in this application we substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package. We now also have separate entrypoints for texts with unicode normalization and without because this also can be done more efficiently once upon preprocessing.

maxbachmann · 2021-04-11T12:12:26Z

One thing to keep in mind is that python-Levenshtein is GPL licensed. When using an external Levenshtein distance implementation it might better to use a different library with a more fitting license. I did a performance comparision in the documentation of my own library (RapidFuzz) https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein and afaik there are multiple implementations, which are more open licensed and faster than python-Levenshtein.

mikegerber · 2021-04-12T17:54:20Z

@maxbachmann Thanks, I will definitely check it out! Does your implementation allow for arbitrary sequences or is it limited to str? If you know any other implementations, please feel free to tell me :) (There are a lot of sequence alignment algorithms, but I don't know how they behave exactly, so I opted to use the basic "optimal" algorithm, i.e. shortest alignment)

mikegerber · 2021-04-12T18:13:59Z

@maxbachmann To elaborate a bit:

In dinglehopper, we need to align both sequences of Unicode characters and also sequences of words. (We need to align those, too, not only compute the edit distance.)

When aligning strings (sequences of Unicode characters) we also need to address a problem that comes up especially when dealing with OCR from historic prints: characters that cannot be represented with a single codepoints. For example, in "Schlym̃" there is a code point sequence/grapheme cluster LATIN SMALL LETTER M + COMBINING TILDE that cannot be normalized to a single code point. This need to be treated as one "character" when aligning/calculating in one way or another.

maxbachmann · 2021-04-13T09:05:18Z

at least right now RapidFuzz only calculates the edit distance and does not has a feature to return the editops. However it mostly does not exist, because I did not have the need for it yet. A implementation in RapidFuzz would probably perform pretty much exactly as well as it does in python-Levenshtein, since it simply extracts one possible path through the Levenshtein matrix.
both python-Levenshtein and RapidFuzz currently only work with strings. E.g. https://github.com/roy-ht/editdistance will always calculate the edit distance between two sequences where each element has to be hashable, which would work with a list of words as well. I opened Support distance calculation for types other than strings rapidfuzz/RapidFuzz#100 in RapidFuzz to discuss the potential implementation.
Similar to python-Levenshtein RapidFuzz works with single code points. So e.g. "Schlym̃", which is represented as 7 characters in Python will be interpreted as 7 characters by RapidFuzz as well. I guess when adding the extension from 2) you could pass the clustered version into the algorithms which would work as expected (especially when you use a string multiple times it would make sense to pass a array.array('Q',...), so the hashes do not have to be calculated multiple times.

mikegerber · 2021-04-23T18:04:01Z

Alright, I decided something:

I am going to merge this PR in the coming days, even though I am not happy with some design decisions, e.g. encoding arbitrary data as a string. The quality of the PR is exceptionally good, it's just that design that held me up.
This is provided it solves some of the performance problems we observed (@cneud mentioned 2 hours processing time for one newspaper page 🐌)
I might do some work on this when merging
If @maxbachmann's RapidFuzz provides (in the future) editops and the ability to process arbitrary sequences/array('Q'), we'll switch to that

b2m · 2021-04-23T19:59:28Z

We still have the GPL-Problem @maxbachmann mentioned.

Dinglehopper is licensed as Apache 2 and therefore can not use a GPL licensed python module (AFAIK).

I would consider it a showstopper for this PR and have not found a suitable replacement that has the same functionality (edit operations + edit distance).

qurator/dinglehopper/edit_distance.py

maxbachmann · 2021-04-23T20:17:50Z

If @maxbachmann's RapidFuzz provides (in the future) editops and the ability to process arbitrary sequences/array('Q'), we'll switch to that

I am currently working on this. Will let you know when it is added.

mikegerber · 2021-04-23T20:20:49Z

@b2m I'll investigate this. I hate licensing problems :-) (My last review comment is weird, it's old and I don't know how to delete it :) )

@maxbachmann Awesome!

mikegerber · 2021-04-24T14:13:02Z

While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL, too. So the python-Levensthein library cannot be used, unfortunately. (Using GPL tools in e.g. Makefiles seems to be fine, though, as these tools are used as separate programs.)

I'll make my team aware of this problem. @cneud

@kba This might be of interest for the OCR-D community, too.

Side not about GPLv3: As I understand this page, we also can't use GPLv3 licensed libraries in our Apache-licensed software. (It would work the other way around, but AFAICT only with GPLv3, not GPLv2.)

maxbachmann · 2021-04-24T14:20:56Z

While I am not a lawyer, I concluded that using the python-Levensthein library (GPLv2) constitues "linking" the library and we are not allowed to do that without making dinglehopper GPL

GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as. So it is unclear how the GPL has to be applied in Python. However it is definitely simpler to use a library with a more open license, since it completely avoids those issues. As a note e.g. FuzzyWuzzy was originally MIT licensed, but got GPL licensed because they use python-Levenshtein and were unsure about this as well.

mikegerber · 2021-04-24T14:47:41Z

GPL and Python are pretty much an open topic, since I am not aware of any law suit on the topic. Generally GPLv2/3 applies when linking the library (both static and dynamic), but it is unclear what a Python import counts as.

I am almost convinced that importing a Python library is essentially dynamic linking but I agree it is not 100% clear. However, I have no doubt in this case because of the SO site-packages/Levenshtein/_levenshtein.cpython-38-x86_64-linux-gnu.so ;-)

I just checked all other (first level) dependencies and all are licensed with less strict licenses (MIT, Apache, BSD). So I think we will go down this route for this aspect of the software as well. (Not discussed with the team yet, but I doubt there is anyone in my team who wants to fist-fight the Free Software Foundation over this.)

b2m · 2021-04-24T18:01:46Z

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)

Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?

kba · 2021-04-25T10:56:35Z

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions)

I am fairly certain that we're not tainted by license incompatibilities anywhere but it would be nice to be more than "fairly certain". So if you do set up something like this, let me know, it might be useful to do this as part of the CI/CD of ocrd_all.

mikegerber · 2021-04-26T11:05:44Z

@mikegerber License checking could also become a part of CI/CD. If this is of interest feel free to open an issue and I will have a look into it. Usually was doing this for Java and with GitLab but I guess the knowledge is transferable to Circle CI or GitHub (Actions) =)

I've reopened #54 for this!

Should I then close this Pull Request and extract the necessary information regarding of this thread into a separate ticket?

Yes, that would be nice!

b2m · 2021-04-26T13:17:51Z

Closing this pull request and opened #56 to keep track of a more appropriate solution.

mikegerber reviewed Nov 17, 2020

View reviewed changes

kba approved these changes Nov 17, 2020

View reviewed changes

mikegerber self-assigned this Nov 17, 2020

b2m added 4 commits February 16, 2021 11:26

Refactor tests in preparation of refactoring levenshtein.

11916c2

Switch between c and own implementation for distance and editops.

0e263cf

Fix method return type

082e308

mikegerber reviewed Apr 23, 2021

View reviewed changes

qurator/dinglehopper/edit_distance.py Show resolved Hide resolved

mikegerber mentioned this pull request Apr 24, 2021

Check licenses of used libraries #54

Closed

16 tasks

b2m mentioned this pull request Apr 24, 2021

Add flexible character accuracy #47

Closed

b2m mentioned this pull request Apr 26, 2021

Improve performance when calculating sequence alignment #56

Closed

b2m closed this Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from custom Levenshtein to python-Levenshtein #48

Switch from custom Levenshtein to python-Levenshtein #48

b2m commented Nov 16, 2020 •

edited

Loading

mikegerber left a comment

b2m commented Nov 17, 2020

kba left a comment

mikegerber commented Nov 17, 2020

mikegerber commented Nov 17, 2020

b2m commented Nov 17, 2020

mikegerber commented Dec 15, 2020

maxbachmann commented Apr 11, 2021

mikegerber commented Apr 12, 2021 •

edited

Loading

mikegerber commented Apr 12, 2021

maxbachmann commented Apr 13, 2021 •

edited

Loading

mikegerber commented Apr 23, 2021

b2m commented Apr 23, 2021

maxbachmann commented Apr 23, 2021

mikegerber commented Apr 23, 2021

mikegerber commented Apr 24, 2021

maxbachmann commented Apr 24, 2021

mikegerber commented Apr 24, 2021

b2m commented Apr 24, 2021

kba commented Apr 25, 2021

mikegerber commented Apr 26, 2021

b2m commented Apr 26, 2021

Switch from custom Levenshtein to python-Levenshtein #48

Switch from custom Levenshtein to python-Levenshtein #48

Conversation

b2m commented Nov 16, 2020 • edited Loading

mikegerber left a comment

Choose a reason for hiding this comment

b2m commented Nov 17, 2020

kba left a comment

Choose a reason for hiding this comment

mikegerber commented Nov 17, 2020

mikegerber commented Nov 17, 2020

b2m commented Nov 17, 2020

mikegerber commented Dec 15, 2020

maxbachmann commented Apr 11, 2021

mikegerber commented Apr 12, 2021 • edited Loading

mikegerber commented Apr 12, 2021

maxbachmann commented Apr 13, 2021 • edited Loading

mikegerber commented Apr 23, 2021

b2m commented Apr 23, 2021

maxbachmann commented Apr 23, 2021

mikegerber commented Apr 23, 2021

mikegerber commented Apr 24, 2021

maxbachmann commented Apr 24, 2021

mikegerber commented Apr 24, 2021

b2m commented Apr 24, 2021

kba commented Apr 25, 2021

mikegerber commented Apr 26, 2021

b2m commented Apr 26, 2021

b2m commented Nov 16, 2020 •

edited

Loading

mikegerber commented Apr 12, 2021 •

edited

Loading

maxbachmann commented Apr 13, 2021 •

edited

Loading