Multiple entries: Remove extraneous URL elements with Regular Expressions #1765

mn7216 · 2024-05-30T22:53:54Z

Is your feature request related to a problem? Please describe. (required)

Yes, most URLs with tracking elements or other irrelevant elements do not function with artist auto-adding.

Describe the solution you'd like. (required)

Use regular expressions to remove extraneous URL elements

Example:

sp.niconico.jp (mobile NicoNico, does not function w/autoadd or link recognition)

Example URL: https://sp.nicovideo.jp/watch/sm2154380
Include Pattern: ^https://sp\.nicovideo\.jp/(.*)$
Output: https://nicovideo.jp/$1

Niconico tracking elements can probably be removed by removing any non-numeric characters after user/ or video/ but I don't remember the structure off the top of my head

Checklist (required)

Fill out the checklist.

[ x ] Searched VocaDB/vocadb for duplicate issues.

saturclay · 2024-06-06T20:32:22Z

I've also thought about this, mostly with the goal of reducing duplicate artist entries. My best guess for the most practical way to implement the changes below would be to run the code to standardize URLs 1) on the create artist page and 2) whenever you save a given entry. That way, a script could go through and just hit save on every artist to update their links, and then there wouldn't be a need to re-run the code on every single link in the db every time it checks for duplicates (hopefully I've phrased that in a way that makes sense).

Here is a non-exhaustive list of changes I'd like to see to fix things that don't get detected as duplicates. A lot of them are for removing query strings, but there are some other odd issues I've seen as well.

NND:

Removing /video from the end of the URL: https://www.nicovideo.jp/user/88776253/video -> https://www.nicovideo.jp/user/88776253
Removing ?* from the end of the URL: https://www.nicovideo.jp/user/117165247/video?ref=pc_userpage_menu -> https://www.nicovideo.jp/user/117165247
This one can't be solved by regex alone, but wanted to mention it. http://www.nicovideo.jp/mylist/53787559 should be equivalent to https://www.nicovideo.jp/user/50263010/mylist/53787559. If you wanted to get rid of the user/50263010/ part, that would be simple with regex, but the issue is that the user/userid/mylist/listid seems to be the new one and is probably what we want to use going forward, so changing the old format ones to the new format would require some other kind of query.

Twitter:

Removing ?* from the end of the URL
Removing #!/ from the middle of the URL: https://twitter.com/#!/nico_agehaP -> https://twitter.com/nico_agehaP
Removing @ from the middle of the URL: https://twitter.com/@nico_agehaP -> https://twitter.com/nico_agehaP

YouTube:

Removing /featured* from end of the URL: https://www.youtube.com/channel/UChcViXBJOkeYOVXDxC_QFyQ/featured?app=desktop -> https://www.youtube.com/channel/UChcViXBJOkeYOVXDxC_QFyQ
Removing /about from the end of the URL

SoundCloud:

Removing ?* from the end of the URL

BiliBili:

Removing #!/index from the end of the URL: https://space.bilibili.com/1289345#!/index = https://space.bilibili.com/1289345

Pixiv:

Changing old format Pixiv URLs to the new format: https://www.pixiv.net/member.php?id=2731279 -> https://pixiv.net/users/2731279
Removing www. from the URL: https://www.pixiv.net/users/2731279 -> https://pixiv.net/users/2731279
Removing /en from the URL: https://pixiv.net/en/users/2731279 -> https://pixiv.net/users/2731279
Removing /artworks from the end of the URL (basically, this is the same as when people have /video at the end of NND URLs): https://pixiv.net/users/2731279/artworks -> https://pixiv.net/users/2731279

I'll add more if I run into any.

Shiroizu · 2024-06-06T21:38:02Z

@saturclay Thanks for the detailed examples.

We could host even more detailed domain:regex -mapping somewhere, even before the feature implementation, since it would be useful for fixing the existing links based on the datadump.

Steps for creating the domain:regex -mapping:

Get the N most common recognized external link domains.
Sort links alphabetically or by length within each domain.
Observe patterns and also decide if the query part is safe to remove.
Generate regex conversion patterns

Links such as https://www.nicovideo.jp/user/50263010/mylist/53787559 are more complicated as those should be replaced with two new links:

bitbybyte · 2024-07-16T03:19:56Z

Not sure if this is better as a new issue but it's also worth mentioning here as well: it would be nice to normalize Twitter/X URLs, as everything redirects to x.com now and that's what people end copying/pasting to entries. Seems like it was added as an external link match already (#1763). Not sure what the right answer is for standardizing existing entries (Wikipedia is still fighting over this) but feels like it'd be worth doing.

saturclay · 2024-08-16T02:40:21Z

I've been working on this and was wondering, what language should I write this in? Would Typescript be preferred?

Also @bitbybyte I think the one concern there would be people who changed/deactivated their accounts before the switch. So if you have user xyz who has an inactive Twitter link, changing it to X would make it so it no longer goes to the right archived page. Of course, we could just not run this on links marked as inactive, but then there's still the concern that there could be links that are inactive, but aren't yet marked as inactive. I think it's a good idea for currently active links, we'd just need to exercise some caution.

andreoda added the weblink Domain object: URL, weblink, hyperlink label Jun 7, 2024

andreoda added this to Ideas, new features, other additions Jun 7, 2024

andreoda moved this to Todo in Ideas, new features, other additions Jun 7, 2024

andreoda added priority: low Issues/Tasks that are not so important complexity: low Around 1 day of work complexity: unknown Unknown days of work and removed complexity: low Around 1 day of work labels Jun 7, 2024

Shiroizu added feature request external solution possible External solutions such as bookmarklets or user scripts are possible for implementing the feature labels Jun 29, 2024

FinnRG mentioned this issue Oct 24, 2024

feat: Support x.com in ArtistExternalUrlParser.cs #1825

Merged

andreoda changed the title ~~Remove extraneous URL elements with Regular Expressions~~ Multiple entries: Remove extraneous URL elements with Regular Expressions Dec 2, 2024

mn7216 mentioned this issue Jan 19, 2025

Song lists: Importing MyLists returns 403 #1854

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple entries: Remove extraneous URL elements with Regular Expressions #1765

Multiple entries: Remove extraneous URL elements with Regular Expressions #1765

mn7216 commented May 30, 2024

saturclay commented Jun 6, 2024

Shiroizu commented Jun 6, 2024 •

edited

Loading

bitbybyte commented Jul 16, 2024

saturclay commented Aug 16, 2024

Multiple entries: Remove extraneous URL elements with Regular Expressions #1765

Multiple entries: Remove extraneous URL elements with Regular Expressions #1765

Comments

mn7216 commented May 30, 2024

Is your feature request related to a problem? Please describe. (required)

Describe the solution you'd like. (required)

Checklist (required)

saturclay commented Jun 6, 2024

Shiroizu commented Jun 6, 2024 • edited Loading

bitbybyte commented Jul 16, 2024

saturclay commented Aug 16, 2024

Shiroizu commented Jun 6, 2024 •

edited

Loading