Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more sources for text results #96

Merged
merged 13 commits into from
Dec 24, 2023
Merged

Implement more sources for text results #96

merged 13 commits into from
Dec 24, 2023

Conversation

davidovski
Copy link

@davidovski davidovski commented Dec 9, 2023

This PR should aim to futher solve issues of being ratelimited by search result providers (ie Google) by scraping results from more sources on LibreY. Might possibly resolve #95

The following should be implemented:

  • allowing the user to select which search engine they want to recieve results from, or "auto" by default
  • switching search engines based on ratelimites or cooldowns when on auto, to balance requests
  • implement scrapers for the following engines:
  • set prefered search engine when doing instance fallback

I may need help implementing these scrapers, so if you would like to help contribute, please tell me which scraper you'd like to implement. Template files have been added for the mentioned engines, so that they can easily be integrated into LibreY.

@davidovski davidovski changed the title [HELP NEEDED] Implement more sources for text results, with balancing. [HELP NEEDED] Implement more sources for text results Dec 9, 2023
@Ahwxorg Ahwxorg mentioned this pull request Dec 9, 2023
@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 9, 2023

Brave Search has a free API, however it's limited by 2000 requests per month. That's something I hit daily, twice. Scraping might be an option here, as it seems to work without JS.

@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 9, 2023

Ecosia does not seem to have an API, scraping will be required if we want to use Ecosia. Running curl on Ecosia results in: Enable JavaScript and cookies to continue, so that might be an issue.

@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 10, 2023

Mojeek.com seems to be really easy to scrape, they have a bunch of <li>'s, sadly no specific classnames etc. They do have an API but it's paid-only.

@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 10, 2023

Google also has a paid API: "Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.

If you need more than 10k queries per day and your Programmable Search Engine searches 10 sites or fewer, you may be interested in the Custom Search Site Restricted JSON API, which does not have a daily query limit."

@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 10, 2023

Bing also allows 1000 requests for free.

1,000 transactions free per month for all markets

@davidovski
Copy link
Author

I think avoiding APIs and using scraping where possible is our best bet. 1000 requests per month for free is very limited and I'm sure most instances will reach it within a day or two. Also scraping would avoid the instance maintainer having to register to the service, which otherwise may be a privacy issue for them. I don't think that instance maintainers would be willing to pay for search results either.

There might be a way to trick ecosia into sending data that it would get via javascript by copying the requests it makes when it does a search, however if this is too complicated it might be worth leaving it out for now and searching for other easily-scrapable services.

@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 10, 2023

I think avoiding APIs and using scraping where possible is our best bet. 1000 requests per month for free is very limited and I'm sure most instances will reach it within a day or two. Also scraping would avoid the instance maintainer having to register to the service, which otherwise may be a privacy issue for them. I don't think that instance maintainers would be willing to pay for search results either.

I know and strongly agree, but issue #95 mentioned APIs, so I decided to do the extra research to show it's pretty impossible to use them whilst keeping LibreY free as in freedom and as in price.

@codedipper
Copy link

Ecosia does not seem to have an API, scraping will be required if we want to use Ecosia. Running curl on Ecosia results in: Enable JavaScript and cookies to continue, so that might be an issue.

This is a Cloudflare issue. Without Tor on an server IP, spoofing the user agent seems to be sufficient for bypassing JavaScript blocks:

curl -V:

$ curl -V
curl 8.5.0 (x86_64-unknown-openbsd7.4) libcurl/8.5.0 LibreSSL/3.8.2 zlib/1.3 nghttp2/1.57.0 ngtcp2/0.19.1 nghttp3/0.15.0
Release-Date: 2023-12-06
Protocols: dict file ftp ftps gopher gophers http https imap imaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS HSTS HTTP2 HTTP3 HTTPS-proxy IPv6 Largefile libz NTLM NTLM_WB SSL threadsafe UnixSockets

Fail:

$ curl -I https://www.ecosia.org/
HTTP/2 403
...

Success (results in legible response):

$ curl -I -A "Mozilla/5.0 (Windows NT 10.0; rv:120.0) Gecko/20100101 Firefox/120.0" https://www.ecosia.org/
HTTP/2 200
...

@davidovski
Copy link
Author

Made it so that the engines are automatically selected when engine is "auto", based on cooldowns.

Squashed a different bug in the last commit: the caching feature was not working properly when doing fallback, since the the url for the fallback instance was being used. Haven't tested with fallback but should work when switching engines.

@davidovski
Copy link
Author

Implemented setting for selecting which engine to use. Added this so that users still have the preference of which engine they want to use. While this may cause these users to experience more ratelimits by limiting their search results to a particular engine, it ensures that they still have the freedom to use whichever results source they want, for example if they don't get the particular results they are looking for with other engines.

This prefered engine parameter is passed to fallback instances with param engine, meaning that fallback instances that implement this commit should be able to return results from the engine that the user has selected.

@davidovski
Copy link
Author

I wasn't able to figure out how to pass a param similar to npfr to duckduckgo, brave and yandex, which on google enables the "did you mean?" feature (rather than "showing results for"). Instead "spellcheck" and "noreask" were used to disable spell correction of results, to avoid confusing queries where LibreY is unable to show the corrected text while still showing results for it (ie gentou linux would show results for gentoo linux without showing the "showing results for, did you mean?")

If someone is able to figure out a way to get a feature like google's npfr for the other engines, that is to still show the misspelled query while offering a suggestion, then that would be great.

@davidovski
Copy link
Author

Implemented Mojeek and Ecosia search scrapers and they seem to be working well. I haven't been able to figure out how to set the language for mojeek results, nor setting them to not autocorrect searches.

Other than these two small features, I think that this PR is ready. By simply varying results from the 4 previous engines, that was enough to prevent my instance from being ratelimited by all of them at the same time.

Selecting a particular results source seems to work pretty well, allowing users to choose an engine if they aren't satisfied with results from another.

If you think these changes as they are, then feel free to merge, though maybe an issue for adding the missing features should be opened.

@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 24, 2023

Looks very great! Might also be nice to add a feature along the lines of "Use all engines except xyz".

@Ahwxorg Ahwxorg marked this pull request as ready for review December 24, 2023 15:47
Copy link
Owner

@Ahwxorg Ahwxorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

engines/text/brave.php Show resolved Hide resolved
engines/text/ecosia.php Show resolved Hide resolved
engines/text/mojeek.php Show resolved Hide resolved
engines/text/text.php Show resolved Hide resolved
engines/text/yandex.php Show resolved Hide resolved
locale/en.php Show resolved Hide resolved
misc/search_engine.php Show resolved Hide resolved
@Ahwxorg Ahwxorg changed the title [HELP NEEDED] Implement more sources for text results Implement more sources for text results Dec 24, 2023
@Ahwxorg Ahwxorg merged commit ccc16be into Ahwxorg:main Dec 24, 2023
@Ahwxorg
Copy link
Owner

Ahwxorg commented Dec 30, 2023

image

markzg from RiseUp emailed me again, @davidovski, see above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No results anymore
4 participants