Implement more sources for text results #96

davidovski · 2023-12-09T00:39:53Z

This PR should aim to futher solve issues of being ratelimited by search result providers (ie Google) by scraping results from more sources on LibreY. Might possibly resolve #95

The following should be implemented:

allowing the user to select which search engine they want to recieve results from, or "auto" by default
switching search engines based on ratelimites or cooldowns when on auto, to balance requests
implement scrapers for the following engines:
set prefered search engine when doing instance fallback

I may need help implementing these scrapers, so if you would like to help contribute, please tell me which scraper you'd like to implement. Template files have been added for the mentioned engines, so that they can easily be integrated into LibreY.

Ahwxorg · 2023-12-09T23:48:57Z

Brave Search has a free API, however it's limited by 2000 requests per month. That's something I hit daily, twice. Scraping might be an option here, as it seems to work without JS.

Ahwxorg · 2023-12-09T23:55:53Z

Ecosia does not seem to have an API, scraping will be required if we want to use Ecosia. Running curl on Ecosia results in: Enable JavaScript and cookies to continue, so that might be an issue.

Ahwxorg · 2023-12-10T00:03:21Z

Mojeek.com seems to be really easy to scrape, they have a bunch of <li>'s, sadly no specific classnames etc. They do have an API but it's paid-only.

Ahwxorg · 2023-12-10T00:06:29Z

Google also has a paid API: "Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.

If you need more than 10k queries per day and your Programmable Search Engine searches 10 sites or fewer, you may be interested in the Custom Search Site Restricted JSON API, which does not have a daily query limit."

Ahwxorg · 2023-12-10T00:17:01Z

Bing also allows 1000 requests for free.

1,000 transactions free per month for all markets

davidovski · 2023-12-10T13:33:52Z

I think avoiding APIs and using scraping where possible is our best bet. 1000 requests per month for free is very limited and I'm sure most instances will reach it within a day or two. Also scraping would avoid the instance maintainer having to register to the service, which otherwise may be a privacy issue for them. I don't think that instance maintainers would be willing to pay for search results either.

There might be a way to trick ecosia into sending data that it would get via javascript by copying the requests it makes when it does a search, however if this is too complicated it might be worth leaving it out for now and searching for other easily-scrapable services.

Ahwxorg · 2023-12-10T14:18:16Z

I think avoiding APIs and using scraping where possible is our best bet. 1000 requests per month for free is very limited and I'm sure most instances will reach it within a day or two. Also scraping would avoid the instance maintainer having to register to the service, which otherwise may be a privacy issue for them. I don't think that instance maintainers would be willing to pay for search results either.

I know and strongly agree, but issue #95 mentioned APIs, so I decided to do the extra research to show it's pretty impossible to use them whilst keeping LibreY free as in freedom and as in price.

codedipper · 2023-12-10T19:20:13Z

Ecosia does not seem to have an API, scraping will be required if we want to use Ecosia. Running curl on Ecosia results in: Enable JavaScript and cookies to continue, so that might be an issue.

This is a Cloudflare issue. Without Tor on an server IP, spoofing the user agent seems to be sufficient for bypassing JavaScript blocks:

curl -V:

$ curl -V
curl 8.5.0 (x86_64-unknown-openbsd7.4) libcurl/8.5.0 LibreSSL/3.8.2 zlib/1.3 nghttp2/1.57.0 ngtcp2/0.19.1 nghttp3/0.15.0
Release-Date: 2023-12-06
Protocols: dict file ftp ftps gopher gophers http https imap imaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS HSTS HTTP2 HTTP3 HTTPS-proxy IPv6 Largefile libz NTLM NTLM_WB SSL threadsafe UnixSockets

Fail:

$ curl -I https://www.ecosia.org/
HTTP/2 403
...

Success (results in legible response):

$ curl -I -A "Mozilla/5.0 (Windows NT 10.0; rv:120.0) Gecko/20100101 Firefox/120.0" https://www.ecosia.org/
HTTP/2 200
...

davidovski · 2023-12-12T17:05:26Z

Made it so that the engines are automatically selected when engine is "auto", based on cooldowns.

Squashed a different bug in the last commit: the caching feature was not working properly when doing fallback, since the the url for the fallback instance was being used. Haven't tested with fallback but should work when switching engines.

davidovski · 2023-12-18T01:39:04Z

Implemented setting for selecting which engine to use. Added this so that users still have the preference of which engine they want to use. While this may cause these users to experience more ratelimits by limiting their search results to a particular engine, it ensures that they still have the freedom to use whichever results source they want, for example if they don't get the particular results they are looking for with other engines.

This prefered engine parameter is passed to fallback instances with param engine, meaning that fallback instances that implement this commit should be able to return results from the engine that the user has selected.

davidovski · 2023-12-18T01:43:32Z

I wasn't able to figure out how to pass a param similar to npfr to duckduckgo, brave and yandex, which on google enables the "did you mean?" feature (rather than "showing results for"). Instead "spellcheck" and "noreask" were used to disable spell correction of results, to avoid confusing queries where LibreY is unable to show the corrected text while still showing results for it (ie gentou linux would show results for gentoo linux without showing the "showing results for, did you mean?")

If someone is able to figure out a way to get a feature like google's npfr for the other engines, that is to still show the misspelled query while offering a suggestion, then that would be great.

davidovski · 2023-12-23T23:36:59Z

Implemented Mojeek and Ecosia search scrapers and they seem to be working well. I haven't been able to figure out how to set the language for mojeek results, nor setting them to not autocorrect searches.

Other than these two small features, I think that this PR is ready. By simply varying results from the 4 previous engines, that was enough to prevent my instance from being ratelimited by all of them at the same time.

Selecting a particular results source seems to work pretty well, allowing users to choose an engine if they aren't satisfied with results from another.

If you think these changes as they are, then feel free to merge, though maybe an issue for adding the missing features should be opened.

Ahwxorg · 2023-12-24T00:52:35Z

Looks very great! Might also be nice to add a feature along the lines of "Use all engines except xyz".

Ahwxorg

nice

engines/text/brave.php

engines/text/ecosia.php

engines/text/mojeek.php

engines/text/text.php

engines/text/yandex.php

locale/en.php

misc/search_engine.php

Ahwxorg · 2023-12-30T00:26:13Z

markzg from RiseUp emailed me again, @davidovski, see above.

davidovski changed the title ~~[HELP NEEDED] Implement more sources for text results, with balancing.~~ [HELP NEEDED] Implement more sources for text results Dec 9, 2023

Ahwxorg mentioned this pull request Dec 9, 2023

No results anymore #95

Open

davidovski added 13 commits December 18, 2023 01:44

add stubs for new text results sources

f3e8b2b

Implement brave search scraper

7c5789d

Add lazy balancing between engines with "auto"

e335ee5

balance results between engines on auto

fb76571

Don't use an engine if its on cooldown

12e635a

fix typo

a21c071

Add yandex scraping

cf4e4ee

ensure that results doesnt stop printing on an invalid result

c756acd

Remove search spelling suggestion

5fa0552

Add prefered engine setting to settings page

899864f

apply engine setting when doing fallback

19a591a

add ecosia search

8decb48

Implement mojeek

28d1307

Ahwxorg marked this pull request as ready for review December 24, 2023 15:47

Ahwxorg approved these changes Dec 24, 2023

View reviewed changes

Ahwxorg changed the title ~~[HELP NEEDED] Implement more sources for text results~~ Implement more sources for text results Dec 24, 2023

Ahwxorg merged commit ccc16be into Ahwxorg:main Dec 24, 2023

davidovski mentioned this pull request Dec 30, 2023

Fix fetching other pages when first page is cached #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more sources for text results #96

Implement more sources for text results #96

davidovski commented Dec 9, 2023 •

edited

Loading

Ahwxorg commented Dec 9, 2023 •

edited

Loading

Ahwxorg commented Dec 9, 2023 •

edited

Loading

Ahwxorg commented Dec 10, 2023 •

edited

Loading

Ahwxorg commented Dec 10, 2023

Ahwxorg commented Dec 10, 2023 •

edited

Loading

davidovski commented Dec 10, 2023

Ahwxorg commented Dec 10, 2023

codedipper commented Dec 10, 2023

davidovski commented Dec 12, 2023

davidovski commented Dec 18, 2023

davidovski commented Dec 18, 2023

davidovski commented Dec 23, 2023

Ahwxorg commented Dec 24, 2023

Ahwxorg left a comment

Ahwxorg commented Dec 30, 2023

Implement more sources for text results #96

Implement more sources for text results #96

Conversation

davidovski commented Dec 9, 2023 • edited Loading

Ahwxorg commented Dec 9, 2023 • edited Loading

Ahwxorg commented Dec 9, 2023 • edited Loading

Ahwxorg commented Dec 10, 2023 • edited Loading

Ahwxorg commented Dec 10, 2023

Ahwxorg commented Dec 10, 2023 • edited Loading

davidovski commented Dec 10, 2023

Ahwxorg commented Dec 10, 2023

codedipper commented Dec 10, 2023

davidovski commented Dec 12, 2023

davidovski commented Dec 18, 2023

davidovski commented Dec 18, 2023

davidovski commented Dec 23, 2023

Ahwxorg commented Dec 24, 2023

Ahwxorg left a comment

Choose a reason for hiding this comment

Ahwxorg commented Dec 30, 2023

davidovski commented Dec 9, 2023 •

edited

Loading

Ahwxorg commented Dec 9, 2023 •

edited

Loading

Ahwxorg commented Dec 9, 2023 •

edited

Loading

Ahwxorg commented Dec 10, 2023 •

edited

Loading

Ahwxorg commented Dec 10, 2023 •

edited

Loading