Skip to content

Commit

Permalink
Fix URL parsing for the extended tag view
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelherger committed Dec 16, 2024
1 parent c860f34 commit b659c0d
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions Slim/Web/Template/SkinManager.pm
Original file line number Diff line number Diff line change
Expand Up @@ -263,9 +263,9 @@ sub _parseURIs {

return $text unless $text;

if (!($text =~ s!\b(https?://[A-Za-z0-9\-_\.\!~*'();/?:@&=+$,]+)!<a href=\"$1\" target=\"_blank\" class="link">$1</a>!igo)) {
if (!($text =~ s!\b(https?://[A-Za-z0-9\-_\.\!~*'();/?:@%&=+$,]+)!<a href=\"$1\" target=\"_blank\" class="link">$1</a>!igo)) {
# handle emusic-type urls which don't have http://
$text =~ s!\b(www\.[A-Za-z0-9\-_\.\!~*'();/?:@&=+$,]+)!<a href=\"http://$1\" target=\"_blank\">$1</a>!igo;
$text =~ s!\b(www\.[A-Za-z0-9\-_\.\!~*'();/?:@%&=+$,]+)!<a href=\"http://$1\" target=\"_blank\">$1</a>!igo;
}

return $text;
Expand Down

9 comments on commit b659c0d

@mikecappella
Copy link
Contributor

@mikecappella mikecappella commented on b659c0d Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you intending to interpolate $, (i.e. $OFS) into the character class? If not, then the RE isn't correct (it won't catch ,) and it is much slower than it needs to be. And when $, warnings are enabled:

Use of uninitialized value $, in regexp compilation at /Applications/Lyrion Music Server.app/Contents/Resources/server/Slim/Web/Template/SkinManager.pm line 268.

If you are not intending interpolation:

        # avoid accidental $ special var interpolation in the character class
        $text and (
            $text =~ s#\b(https?://[A-Za-z0-9_.!~*();/?:@%&=+,'\-\$]+)#<a href=\"$1\" target=\"_blank\" class="link">$1</a>#igo or
            # handle emusic-type urls which don't have http://
            $text =~     s#\b(www\.[A-Za-z0-9_.!~*();/?:@%&=+,'\-\$]+)#<a href=\"http://$1\" target=\"_blank\">$1</a>#igo
        );

        return $text;

Also note that . doesn't need escapement in a character class. Escapement of - can be avoided too, if it is the first char in the class, but I left it with the other punct at the end.

Finally, the g flag was curious to me given the two substitutions, in that the code can't catch the two types of URIs for a given string.

@mikecappella
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited my reply above.

@michaelherger
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me what would make that big performance difference? It's not an easy piece of code to read... And how did you test?

And while we're at it: wouldn't we want to have the hash tag in there, too? It can define an anger on a linked resource.

BTW: with the i modifier we'd not even have to define both upper and lower case a-z in the class.

@mikecappella
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore my mention of performance. I'd used single quote substitution delimiters to avoid interpolation. That was the big speed win. But of course that disables the replacement side too! It was a last minute change and when I ran my Benchmark::cmpthese() tests, I didn't notice the error in initial output of my lengthy test URLs.

Yes, of course the i modifier, so it would be good to remove either A-Z or a-z in the char class.

Technically, that RE for parsing a URI isn't correct. The host part cannot contain #, ? or /. And the path part cannot contain ? or #. But I surmised you just wanted something quick and dirty that mostly worked. Your call on how well this RE needs to work.

See: Parsing a URI Reference with a Regular Expression in RFC 3986.

https://www.rfc-editor.org/rfc/rfc3986#appendix-B

Given that the value may have multiple URIs, anywhere, with no well-defined delimiters, you can't use that RE since once you remove the beginning of line anchor it will degenerate and greedily also match the end of line empty string, so the global substitution would have to be replaced with a looping match so that you could ignore an empty string match. I didn't test or benchmark that option (but did test a global var qr/ / version of both REs; that turned out to be surprisingly much slower that the /o flag in the string version).

(Odd aside: GitHub seems to have enabled some new Commit feature preview for me since a few hours ago when I posted the comment above. With it enabled, I see comments when I'm logged out, but once I log in, they stop loading, both Safari and Firefox.).

@michaelherger
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this really was a q'n'd hack to get something working, now tweaking. But I don't expect it to cover multiple URLs etc. I'd consider those corner cases. It's been good enough for the first several months 😁. So... what do you think should be fixed?

BTW: I wouldn't consider performance an issue really, as this regex would only be used once or twice per page rendering.

And yes about the latest Github changes. I've had similar remarks from other users, too. I myself only saw that kind of in-line/floating comment once. Ever since I've had old school comments only again.

@mikecappella
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a list context global match RE that I want to test more carefully tomorrow. It should allow eliminating the second RE, replaced by testing for absence of protocol and presence of 'www' for the emusic-type.

I didn't feel I understood the range of value possibilities to know how loose or tight the RE had to be; your reply helps.

@mikecappella
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, perhaps you can try this out:

sub _parseURIs {
        my ($text) = @_;

        sub make_a {
            my ($url, $prot) = @_;
            return $prot ?
                qq(<a href="$url" target="_blank" class="link">$url</a>) :
                qq(<a href="http://$url" target="_blank">$url</a>); # emusic-type urls which don't have http://
        }

        $text and
            $text =~ s'((?:(?:(https?)://.)?(?:[-a-z\d.]{1,256}\.[a-z]{2,6}))\b(?:[-a-z\d_.!~*\'();/?:@%&#=+,$]*))'make_a($1, $2)'igoe;

        return $text;
}

This isn't strict, and it can't be given the freedom of the data input. Here's a test string I used:

https://example.com
https://example.com/
sample.com/path
www.emusic.com/path
ftp://someftp
must pass this https testhttps://example.com?page=1#anchor1&param2#anchor2"blah foo bar www.emusic.com/foo/bar#anchor.   https://musicbrainz.org/release/7d610a51-c943-4750-b96e-b8fb54d15831?tport=8000 https://www.discogs.com/release/5210482-Beyonc%C3%A9-Beyonc%C3%A9
https://www.567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123.us/path https://www.discogs.com/release/2036165-Tcha%C3%AFkovsky-Mendelssohn-Arthur-Grumiaux-Violin-Concertos

and the output:

<a href="https://example.com" target="_blank" class="link">https://example.com</a>
<a href="https://example.com/" target="_blank" class="link">https://example.com/</a>
<a href="http://sample.com/path" target="_blank">sample.com/path</a>
<a href="http://www.emusic.com/path" target="_blank">www.emusic.com/path</a>
ftp://someftp
must pass this https test<a href="https://example.com?page=1#anchor1&param2#anchor2" target="_blank" class="link">https://example.com?page=1#anchor1&param2#anchor2</a>"blah foo bar <a href="http://www.emusic.com/foo/bar#anchor." target="_blank">www.emusic.com/foo/bar#anchor.</a>   <a href="https://musicbrainz.org/release/7d610a51-c943-4750-b96e-b8fb54d15831?tport=8000" target="_blank" class="link">https://musicbrainz.org/release/7d610a51-c943-4750-b96e-b8fb54d15831?tport=8000</a> <a href="https://www.discogs.com/release/5210482-Beyonc%C3%A9-Beyonc%C3%A9" target="_blank" class="link">https://www.discogs.com/release/5210482-Beyonc%C3%A9-Beyonc%C3%A9</a>
<a href="https://www.567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123.us/path" target="_blank" class="link">https://www.567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123.us/path</a> <a href="https://www.discogs.com/release/2036165-Tcha%C3%AFkovsky-Mendelssohn-Arthur-Grumiaux-Violin-Concertos" target="_blank" class="link">https://www.discogs.com/release/2036165-Tcha%C3%AFkovsky-Mendelssohn-Arthur-Grumiaux-Violin-Concertos</a>

It probably isn't desirable to create a link from the sample.com/path value, but again, the data input is too undefined. And the existing code would do likewise when the data didn't contain any other URLs starting with https?.

After I increased the 256 byte limit in Slim::Menu::TrackInfo::tagDump(), the appropriate links were clickable in the tag dump:

image

@michaelherger
Copy link
Member Author

@michaelherger michaelherger commented on b659c0d Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's even less human readable 😂. I first read it on the phone...

What about the o modifier? perlre says:

o - pretend to optimize your code, but actually introduce bugs

Sounds promising 😁. We can live without that, can't we? Could you please submit a PR against 9.1 (let's be cautious...)?

Would you know whether that regex was working in JavaScript, too? Material is parsing on the client.

@mikecappella
Copy link
Contributor

@mikecappella mikecappella commented on b659c0d Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ugly, I get it. The ambiguity of the data presents the challenge, the RE trying to do what the original code purported to do. But its effort was later thwarted by the 256 char limit (does it need to be so low?).

If the need is to only match a single... the first... URL in a value, the original code with $ properly escaped and # added is good enough; it can't be perfect. Let me know if I should still create the 9.1 PR (I fear your plate hath runneth over).

The /o modifer is safe and useful if there's no interpolation in the pattern. In much older Perl versions, it could have some issues, and there were bugs in the Perl code optimizer. It was also added to far too many REs without thought.

In the original code, if warnings were enabled, Perl would have emitted one warning (as it interpolated $, during RE compilation). Remove the \o, and Perl emits one warning per pattern match attempt... because it has to interpolate $, on every pass, thus recompile the RE. The happenstance of enabling warnings and $, evaluating to undef revealed three hidden problems in the RE.

The RE itself should work in JavaScript so long as it's doing greedy capturing. Obviously the <a> formatting will require different code.

Off topic: Here's where I was hoping to go with all this... my files have various ID tags for amg, discogs, mb, etc.. I was hoping to get time to see if I could extend Advanced Tag View to construct appropriate clickable URLs based on those IDs, presented in a list if the tag had multiple IDs (e.g. semicolon-separated). My taggers create these and push them to media software I use for library management (and it supports doing this via its expression language and UI). But it would be much more efficient during listening via Lyrion to jump into track details and click the link constructed from the ID(s).

Please sign in to comment.