Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple entries: Remove extraneous URL elements with Regular Expressions #1765

Open
mn7216 opened this issue May 30, 2024 · 4 comments
Open
Labels
complexity: unknown Unknown days of work external solution possible External solutions such as bookmarklets or user scripts are possible for implementing the feature feature request priority: low Issues/Tasks that are not so important weblink Domain object: URL, weblink, hyperlink

Comments

@mn7216
Copy link
Contributor

mn7216 commented May 30, 2024

Is your feature request related to a problem? Please describe. (required)

Yes, most URLs with tracking elements or other irrelevant elements do not function with artist auto-adding.

Describe the solution you'd like. (required)

Use regular expressions to remove extraneous URL elements

Example:

sp.niconico.jp (mobile NicoNico, does not function w/autoadd or link recognition)

Example URL: https://sp.nicovideo.jp/watch/sm2154380
Include Pattern: ^https://sp\.nicovideo\.jp/(.*)$
Output: https://nicovideo.jp/$1

Niconico tracking elements can probably be removed by removing any non-numeric characters after user/ or video/ but I don't remember the structure off the top of my head

Checklist (required)

Fill out the checklist.

@saturclay
Copy link
Contributor

I've also thought about this, mostly with the goal of reducing duplicate artist entries. My best guess for the most practical way to implement the changes below would be to run the code to standardize URLs 1) on the create artist page and 2) whenever you save a given entry. That way, a script could go through and just hit save on every artist to update their links, and then there wouldn't be a need to re-run the code on every single link in the db every time it checks for duplicates (hopefully I've phrased that in a way that makes sense).

Here is a non-exhaustive list of changes I'd like to see to fix things that don't get detected as duplicates. A lot of them are for removing query strings, but there are some other odd issues I've seen as well.

NND:

Twitter:

YouTube:

SoundCloud:

  • Removing ?* from the end of the URL

BiliBili:

Pixiv:

I'll add more if I run into any.

@Shiroizu
Copy link
Member

Shiroizu commented Jun 6, 2024

@saturclay Thanks for the detailed examples.

We could host even more detailed domain:regex -mapping somewhere, even before the feature implementation, since it would be useful for fixing the existing links based on the datadump.

Steps for creating the domain:regex -mapping:

  1. Get the N most common recognized external link domains.
  2. Sort links alphabetically or by length within each domain.
  3. Observe patterns and also decide if the query part is safe to remove.
  4. Generate regex conversion patterns

Links such as https://www.nicovideo.jp/user/50263010/mylist/53787559 are more complicated as those should be replaced with two new links:

@andreoda andreoda added the weblink Domain object: URL, weblink, hyperlink label Jun 7, 2024
@andreoda andreoda added priority: low Issues/Tasks that are not so important complexity: low Around 1 day of work complexity: unknown Unknown days of work and removed complexity: low Around 1 day of work labels Jun 7, 2024
@Shiroizu Shiroizu added feature request external solution possible External solutions such as bookmarklets or user scripts are possible for implementing the feature labels Jun 29, 2024
@bitbybyte
Copy link

Not sure if this is better as a new issue but it's also worth mentioning here as well: it would be nice to normalize Twitter/X URLs, as everything redirects to x.com now and that's what people end copying/pasting to entries. Seems like it was added as an external link match already (#1763). Not sure what the right answer is for standardizing existing entries (Wikipedia is still fighting over this) but feels like it'd be worth doing.

@saturclay
Copy link
Contributor

I've been working on this and was wondering, what language should I write this in? Would Typescript be preferred?

Also @bitbybyte I think the one concern there would be people who changed/deactivated their accounts before the switch. So if you have user xyz who has an inactive Twitter link, changing it to X would make it so it no longer goes to the right archived page. Of course, we could just not run this on links marked as inactive, but then there's still the concern that there could be links that are inactive, but aren't yet marked as inactive. I think it's a good idea for currently active links, we'd just need to exercise some caution.

@andreoda andreoda changed the title Remove extraneous URL elements with Regular Expressions Multiple entries: Remove extraneous URL elements with Regular Expressions Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity: unknown Unknown days of work external solution possible External solutions such as bookmarklets or user scripts are possible for implementing the feature feature request priority: low Issues/Tasks that are not so important weblink Domain object: URL, weblink, hyperlink
Projects
Development

No branches or pull requests

5 participants