Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: inconsistent results between LicenseCompareHelper.isTextStandardLicense(), LicenseCompareHelper.isStandardLicenseWithinText(), and LicenseCompareHelper.matchingStandardLicenseIdsWithinText() #182

Closed
pmonks opened this issue Jun 28, 2023 · 22 comments

Comments

@pmonks
Copy link
Collaborator

pmonks commented Jun 28, 2023

When tested with this BSD-2-Clause text, and the BSD-2-Clause listed license object where appropriate, these three methods in org.spdx.utility.compare.LicenseCompareHelper return inconsistent results (emoji represent whether this result is expected or not):

  • !isTextStandardLicense().isDifferenceFound() returns true
  • isStandardLicenseWithinText() returns false
  • matchingStandardLicenseIdsWithinText() returns an empty collection ❌
  • matchingStandardLicenseIds() returns ["BSD-2-Clause"]

This was observed with v1.1.7 of the library, on JVM OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing). This worked correctly in previous version (at least in v1.1.5 - not sure about v1.1.6).

@pmonks pmonks changed the title Inconsistent results between LicenseCompareHelper.isTextStandardLicense(), LicenseCompareHelper.isStandardLicenseWithinText(), and LicenseCompareHelper.matchingStandardLicenseIdsWithinText() Regression: inconsistent results between LicenseCompareHelper.isTextStandardLicense(), LicenseCompareHelper.isStandardLicenseWithinText(), and LicenseCompareHelper.matchingStandardLicenseIdsWithinText() Jun 29, 2023
@pmonks
Copy link
Collaborator Author

pmonks commented Jun 29, 2023

Update: after re-testing with v1.1.5 and v1.1.6 this appears to be caused by a change to the license list XML for BSD-2-Clause in v3.21 of the license list, probably this commit. That said, it still appears to be a bug in Spdx-Java-library, as my understanding is that these methods should always behave consistently with one another when given the same (singleton) license text, regardless of how the license list XML changes.

[edit] this may be a furphy, given that the same inconsistency seems to be happening with other licenses (see next comment).

@pmonks
Copy link
Collaborator Author

pmonks commented Jul 10, 2023

I'm also seeing the same inconsistencies between the outputs of these methods when testing with the official WTFPL license text:

  • !isTextStandardLicense().isDifferenceFound() returns true
  • isStandardLicenseWithinText() returns false
  • matchingStandardLicenseIdsWithinText() returns an empty collection ❌
  • matchingStandardLicenseIds() returns ["WTFPL"]

@goneall
Copy link
Member

goneall commented Jul 10, 2023

@pmonks Thanks for tracking this down - agree it is an issue.

Just FYI - I'm going to be working on upgrading the library to the 3.0 spec over the next couple weeks, so it may be a while before I can work on this issue.

@pmonks
Copy link
Collaborator Author

pmonks commented Jul 10, 2023

Yep no worries - no hurry on my end. Just reporting things as I see them!

@pmonks
Copy link
Collaborator Author

pmonks commented Aug 2, 2023

Here's an MIT license text that's also triggering this exact same pattern of behaviour (some methods work as expected, others don't).

@goneall
Copy link
Member

goneall commented Dec 4, 2023

I've done some analysis and isolated the issue to the findTemplateWithinText(String text, String template) method.

Below is the Regex generated from the license template:

.?{0,5000}\QRedistribution\E\s*\Qand\E\s*\Quse\E\s*\Qin\E\s*\Qsource\E\s*\Qand\E\s*\Qbinary\E\s*\Qforms,\E\s*\Qwith\E\s*\Qor\E\s*\Qwithout\E\s*\Qmodification,\E\s*\Qare\E\s*\Qpermitted\E\s*\Qprovided\E\s*\Qthat\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qconditions\E\s*\Qare\E\s*\Qmet:\E\s*.{0,20}\QRedistributions\E\s*\Qof\E\s*\Qsource\E\s*\Qcode\E\s*\Qmust\E\s*\Qretain\E\s*\Qthe\E\s*\Qabove\E\s*\Q-c-\E\s*\Qnotice,\E\s*\Qthis\E\s*\Qlist\E\s*\Qof\E\s*\Qconditions\E\s*\Qand\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qdisclaimer.\E\s*.{0,20}\QRedistributions\E\s*\Qin\E\s*\Qbinary\E\s*\Qform\E\s*\Qmust\E\s*\Qreproduce\E\s*\Qthe\E\s*\Qabove\E\s*\Q-c-\E\s*\Qnotice,\E\s*\Qthis\E\s*\Qlist\E\s*\Qof\E\s*\Qconditions\E\s*\Qand\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qdisclaimer\E\s*\Qin\E\s*\Qthe\E\s*\Qdocumentation\E\s*\Qand/or\E\s*\Qother\E\s*\Qmaterials\E\s*\Qprovided\E\s*\Qwith\E\s*\Qthe\E\s*\Qdistribution.\E\s*\QTHIS\E\s*.{0,5}\QIS\E\s*\QPROVIDED\E\s*\QBY\E\s*.{1,800}\Q"AS\E\s*\QIS"\E\s*\QAND\E\s*\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*\QPURPOSE\E\s*\QARE\E\s*\QDISCLAIMED.\E\s*\QIN\E\s*\QNO\E\s*\QEVENT\E\s*\QSHALL\E\s*.+\QBE\E\s*\QLIABLE\E\s*\QFOR\E\s*\QANY\E\s*\QDIRECT,\E\s*\QINDIRECT,\E\s*\QINCIDENTAL,\E\s*\QSPECIAL,\E\s*\QEXEMPLARY,\E\s*\QOR\E\s*\QCONSEQUENTIAL\E\s*\QDAMAGES\E\s*\Q(INCLUDING,\E\s*\QBUT\E\s*\QNOT\E\s*\QLIMITED\E\s*\QTO,\E\s*\QPROCUREMENT\E\s*\QOF\E\s*\QSUBSTITUTE\E\s*\QGOODS\E\s*\QOR\E\s*\QSERVICES;\E\s*\QLOSS\E\s*\QOF\E\s*\QUSE,\E\s*\QDATA,\E\s*\QOR\E\s*\QPROFITS;\E\s*\QOR\E\s*\QBUSINESS\E\s*\QINTERRUPTION)\E\s*\QHOWEVER\E\s*\QCAUSED\E\s*\QAND\E\s*\QON\E\s*\QANY\E\s*\QTHEORY\E\s*\QOF\E\s*\QLIABILITY,\E\s*\QWHETHER\E\s*\QIN\E\s*\QCONTRACT,\E\s*\QSTRICT\E\s*\QLIABILITY,\E\s*\QOR\E\s*\QTORT\E\s*\Q(INCLUDING\E\s*\QNEGLIGENCE\E\s*\QOR\E\s*\QOTHERWISE)\E\s*\QARISING\E\s*\QIN\E\s*\QANY\E\s*\QWAY\E\s*\QOUT\E\s*\QOF\E\s*\QTHE\E\s*\QUSE\E\s*\QOF\E\s*\QTHIS\E\s*\Q,\E\s*\QEVEN\E\s*\QIF\E\s*\QADVISED\E\s*\QOF\E\s*\QTHE\E\s*\QPOSSIBILITY\E\s*\QOF\E\s*\QSUCH\E\s*\QDAMAGE.\E\s*

And below is the normalized text that should match the above regex:

-c- -c- 2015, Atlassian Pty Ltd
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

 Redistributions of source code must retain the above -c- notice, this
  list of conditions and the following disclaimer.

 Redistributions in binary form must reproduce the above -c- notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE -c- HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE -c- HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Next step would be to analyze why the regex doesn't match.

@pmonks
Copy link
Collaborator Author

pmonks commented Dec 4, 2023

@goneall thanks for pulling those out! Couple of things I found:

  • The .?{0,5000} regex fragment at the start doesn't look correct - it should only have either ? (match zero or one character) or {0,5000} (match zero to 5000 characters), but not both. I believe .{0,5000} is what we need here.
  • . in a Java regex won't match newlines by default, so .{0,5000} still won't match the multi-line Atlassian copyright header unless DOTALL mode is enabled. IME doing this using inline directives is better than doing it programmatically (it makes regexes easier to troubleshoot in isolation), so the fix here would be to start the regex with (?s).{0,5000}
  • <optional> tags in the template are handled incorrectly / inconsistently:
    • the first one in the template is turned into the regex fragment .{0,5} (around index 909 of the regex posted above), which isn't long enough for the word that appears there by default in the license text (SOFTWARE, length of 8).
    • the second one in the template doesn't appear in the regex at all - the default text is removed but there's no regex fragment in its place (around index 1891 of the regex)

I can't see any other obvious problems, though manually fixing the regex for each of these issues still didn't match the Atlassian BSD-2-Clause text, so I'm not certain that I've found everything. My suggestion would be to fix these identified issues, then start troubleshooting again with the improved regex if it's still not matching.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

@pmonks - thanks for the analysis.

DOTALL and CASE_INSENSITIVE are added in the Java pattern. Since we are getting patterns passed in from the template and most (if not all) of the regexes for the alt element for the license list XML's don't specify matching new lines, this is overridden in the Java code itself.

There is some logic to change from greedy to non-greedy regexes - I think that's where the error may be. I'll do a bit more research.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

I did a bit more research - found where we're adding the .? when it should be just a ?. Unfortunately, it is still hitting the same issue.

This is probably an issue whenever we have optional or variable text at the beginning of the license template.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

Narrowed down the issue.

The template THIS<<beginOptional>> SOFTWARE<<endOptional>> is being encoded in the regex as \s*\QTHIS\E\s*.{0,5}\QIS\E\s*. Note that there is no regex to accept the SOFTWARE token even though it is optional.

It looks like the code only allowing optional text to be 5 characters long.

What we should do is capture the optional text, tokenize it and make it a character group as optional.

An easier solution would be to use the actual length of the optional text rather than the hard coded length of 5.

Both of these are a bit of a design change.

A short term fix would be to just increase the length of the optional text to something more reasonable.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

After changing the character limit to 50, I ran into a second part of the match which failed:

\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*

isn't matching:

ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR

which has me a bit baffled.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

Figured it out - needed to capture the newlines - changing .{0,36000} to (.|\n){0,36000} works.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

The above change caused a stack overflow in the regex engine when comparing the entire text - so it isn't a solution. Since we're using DOTALL, I'm not sure why we need the (.|\n) pattern.

Not quite sure where to go from here - other than replacing the regex with a hand build parser.

@sdheh
Copy link

sdheh commented Dec 6, 2023

What if you only used the regex to find only the beginning of the non-optional license text and ran LicenseCompareHelper.isTextStandardLicense on the rest of the text starting from there? If there is no difference or DifferenceDescription.getDifferenceMessage() starts with "Additional text found after the end of the expected license text", then there is a match for the license.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

What if you only used the regex to find only the beginning of the non-optional license text and ran LicenseCompareHelper.isTextStandardLicense on the rest of the text starting from there? If there is no difference or DifferenceDescription.getDifferenceMessage() starts with "Additional text found after the end of the expected license text", then there is a match for the license.

Thanks @sdheh - Sounds like a good idea. I'll try it and see if it works.

@goneall
Copy link
Member

goneall commented Dec 6, 2023

The checking for the difference message didn't work - I was getting inconsistent message.

However, I figured out a different approach that did work.

I created 2 different patterns - one for the start and one for the end and just ran the 2 different regexes to get the beginning and end of the license text.

I'll work on a pull request after testing out other scenarios.

@sdheh
Copy link

sdheh commented Dec 7, 2023

The checking for the difference message didn't work - I was getting inconsistent message.

Would you please tell me of some cases where it didn't work? I tried doing this myself and didn't run into any problems.

@goneall
Copy link
Member

goneall commented Dec 7, 2023

Would you please tell me of some cases where it didn't work? I tried doing this myself and didn't run into any problems.

I didn't keep track of all the details, but it was one of the unit tests I wrote for the recent matching issues. The error message that came back wasn't not the expected "Additional text found ..." but a different message. Possibly related to a variable or optional.

Since the second approach worked, I didn't investigate further.

@pmonks
Copy link
Collaborator Author

pmonks commented Dec 7, 2023

@goneall will this approach (having two separate "before" and "after" regexes) work if a template has many <<optional>> elements? From memory there are quite a few license templates in this category.

BTW I'm also stumped as to why .{0,36000} would not match when (.|\n){0,36000} does, provided DOTALL mode is enabled. In fact when I test this using inline directives it does work, which leads me to suspect that DOTALL mode is not being enabled programmatically in some cases (another argument in favour of inline directives).

Here's how I tested (note: Clojure code, but it boils down to JVM byte codes / Java regexes under the covers):

user=> (def re #"(?i)(?s)\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*")
#'user/re
user=> (def s "ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR")
#'user/s
user=> (re-matches re s)
["ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR" nil]
user=> (re-matches re "This does not match, just to show what that would look like.")
nil

@goneall
Copy link
Member

goneall commented Dec 8, 2023

will this approach (having two separate "before" and "after" regexes) work if a template has many <> elements?

The optional text is replaced by a match all regex .{0,xxx} where xxx is a limit. There is a separate issue where xxx is being set too small causing some licenses not to match. I'm working on this.

If you have a license that may not work, I can add it to the unit tests just to be sure.

BTW I'm too stumped as to why .{0,36000} would not match when (.|\n){0,36000} does, provided DOTALL mode is enabled.

Me too. here's the code that sets the DOTALL:
https://github.com/spdx/Spdx-Java-Library/blob/df468223821d6ee879288094a986f6df7bf49be9/src/main/java/org/spdx/utility/compare/LicenseCompareHelper.java#L796C3-L796C3

It doesn't look like there is any other code path that would not set DOTALL.

Let me know if you see anything wrong.

What is also strange is that the DOTALL works for other patterns that span lines.

@pmonks
Copy link
Collaborator Author

pmonks commented Dec 8, 2023

Me too. here's the code that sets the DOTALL:
df46822/src/main/java/org/spdx/utility/compare/LicenseCompareHelper.java#L796C3-L796C3

Bizarre. I did a search for regex compilation throughout the code, just to see if there might be an alternative code path that's being used instead of that one, but I don't see anything obvious from that (regexes are compiled elsewhere, but they don't seem to be used in template matching).

@goneall
Copy link
Member

goneall commented Dec 13, 2023

This is now fixed with PR #221

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants