-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement more sources for text results #96
Conversation
Brave Search has a free API, however it's limited by 2000 requests per month. That's something I hit daily, twice. Scraping might be an option here, as it seems to work without JS. |
Ecosia does not seem to have an API, scraping will be required if we want to use Ecosia. Running |
Mojeek.com seems to be really easy to scrape, they have a bunch of <li>'s, sadly no specific classnames etc. They do have an API but it's paid-only. |
Google also has a paid API: "Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day. If you need more than 10k queries per day and your Programmable Search Engine searches 10 sites or fewer, you may be interested in the Custom Search Site Restricted JSON API, which does not have a daily query limit." |
Bing also allows 1000 requests for free.
|
I think avoiding APIs and using scraping where possible is our best bet. 1000 requests per month for free is very limited and I'm sure most instances will reach it within a day or two. Also scraping would avoid the instance maintainer having to register to the service, which otherwise may be a privacy issue for them. I don't think that instance maintainers would be willing to pay for search results either. There might be a way to trick ecosia into sending data that it would get via javascript by copying the requests it makes when it does a search, however if this is too complicated it might be worth leaving it out for now and searching for other easily-scrapable services. |
I know and strongly agree, but issue #95 mentioned APIs, so I decided to do the extra research to show it's pretty impossible to use them whilst keeping LibreY free as in freedom and as in price. |
This is a Cloudflare issue. Without Tor on an server IP, spoofing the user agent seems to be sufficient for bypassing JavaScript blocks:
Fail:
Success (results in legible response):
|
Made it so that the engines are automatically selected when engine is "auto", based on cooldowns. Squashed a different bug in the last commit: the caching feature was not working properly when doing fallback, since the the url for the fallback instance was being used. Haven't tested with fallback but should work when switching engines. |
Implemented setting for selecting which engine to use. Added this so that users still have the preference of which engine they want to use. While this may cause these users to experience more ratelimits by limiting their search results to a particular engine, it ensures that they still have the freedom to use whichever results source they want, for example if they don't get the particular results they are looking for with other engines. This prefered engine parameter is passed to fallback instances with param |
I wasn't able to figure out how to pass a param similar to npfr to duckduckgo, brave and yandex, which on google enables the "did you mean?" feature (rather than "showing results for"). Instead "spellcheck" and "noreask" were used to disable spell correction of results, to avoid confusing queries where LibreY is unable to show the corrected text while still showing results for it (ie If someone is able to figure out a way to get a feature like google's |
Implemented Mojeek and Ecosia search scrapers and they seem to be working well. I haven't been able to figure out how to set the language for mojeek results, nor setting them to not autocorrect searches. Other than these two small features, I think that this PR is ready. By simply varying results from the 4 previous engines, that was enough to prevent my instance from being ratelimited by all of them at the same time. Selecting a particular results source seems to work pretty well, allowing users to choose an engine if they aren't satisfied with results from another. If you think these changes as they are, then feel free to merge, though maybe an issue for adding the missing features should be opened. |
Looks very great! Might also be nice to add a feature along the lines of "Use all engines except xyz". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
markzg from RiseUp emailed me again, @davidovski, see above. |
This PR should aim to futher solve issues of being ratelimited by search result providers (ie Google) by scraping results from more sources on LibreY. Might possibly resolve #95
The following should be implemented:
I may need help implementing these scrapers, so if you would like to help contribute, please tell me which scraper you'd like to implement. Template files have been added for the mentioned engines, so that they can easily be integrated into LibreY.