-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-HTTP CONNECT proxies #140
Comments
Hello and thank you for your interest in this project!
This Agent is only used with http proxy / http target. Something like this will work: gotScraping
.get({
url: 'http://jindrich.bar/', // https:// will fail
proxyUrl: 'http://127.0.0.1:3000/',
}) This is because without the CONNECT support on the proxy side, I see you've already commented on an issue in the loopback repo - you all have good points there, if I can add my two cents - I don't think it makes much sense to run a caching proxy locally, exactly because of TLS. Perhaps you can mock the network requests in your tests with nock... or something similar? |
Good to know! Thanks.
I can see how those could definitely be issues in a lot of situations but unless I'm misunderstanding things I think local caching still remains a strong use case and not one where TLS errors are likely or MITM attacks are much of a concern. Requesting HTTPS resources via a target based proxy means that as far as got is concerned it's making a HTTP request and as far the end server is concerned it's receiving a HTTPS connection. I don't think there should be anything to tip the end server off to this being anything other than a standard HTTPS connection (again if I'm understanding things correctly?). If you're accessing sites that are heavily client-side driven then there could well be issues with client javascript freaking out because a resource it's expecting to be accessed via HTTPS is instead being accessed via HTTP. But if that's the case you're probably more likely to be using playwright or puppeteer rather than got. So I don't think someone using got-scraping for this is likely to run into any TLS issues (but once again I could be misunderstanding something here). MITM attacks are of course a serious concert in general. However I'm not sure that's too much of a risk in this situation. The only major reason I can think of when using got-scrapping for wanting to access HTTPS sites over a HTTP target based proxy rather than the much more standard (for HTTPS) CONNECT method, is so the content can be transparently cached. And if you're setting up a caching proxy as part of your app (with something like @loopback/http-caching-proxy) then the first HTTP hop from got to the caching proxy never leaves your machine and so provides little greater opportunity for a MITM attack (unless an attacker has access to your machine but you've probably got bigger problems in that case). Of cause you do need to trust your proxy, which is likely written by a third-party, but that's not much more risk than trusting got-scrapping or any of the deps that got-scrapping uses. I could certainly see that supporting HTTPS requests via a HTTP target based proxy would provide an extra way for someone to shoot themselves in the foot if they were to use it in a silly way like using an unencrypted proxy over the internet. But since if this feature were to be implemented I imagine it would be something you'd explicitly have to opt into I think including a warning about that in the same doco that describes how to enable it should be sufficient to provide reasonable mitigation. To provide a stronger case as to why I think such a feature is helpful I'll flesh out the use-case I'm envisioning. got-scraping is obviously intended for scrapping websites and of cause web scrapping code is often pretty brittle as the website can be updated at any time with no warning. So it's possible that you might be regularly needing to update your scrapping code. When you're updating your scrapping code you may need to make many requests to ensure things are working right. In an ideal world you'd even have a tool like vitest running in watch mode which runs an extensive test suite on code changes so you can get immediate feedback as to whether you've fixed something or not. That's the setup I have in the app I'm using got-scrapping for. However that can mean repeatedly making the same request even though you're expecting the result to be unchanged. Depending on the situation that might not be an issue at all. But if you're dealing with tight api limits it could certainly be an issue.
Nock looks very cool! I hadn't come across it before so thanks for introducing me. I'm not seeing a pre-existing way to use it for automatically caching responses but doesn't seem like it would be too hard to write something similar to @loopback/http-caching-proxy but for nock. I actually have come up with another workaround (since I really wanted transparent caching of requests and came up with this workaround soon after creating this issue). It's ugly and I still think it would be good to have support at the got-scrapping level but it works! (at least for my situation) I've added a small bit of code just before a request is actually passed to got-scraping that rewrites a request like if (cachingProxyPort) {
const targetUrl = new URL(url)
url = `http://localhost:${cachingProxyPort}${targetUrl.pathname}`
additionalHeaders["x-real-origin"] = targetUrl.origin
} got-scrapping will then diligently send it to the first proxy. This proxy takes the request and formats it in the format expected by a HTTP target based proxy (since that's what the caching proxy requires): import http from "http";
import { ProxyServer } from '@refactorjs/http-proxy' // The most popular node http proxy library, and the
// one this is a refactor of, has essentially been unmaintained for the past 4 years and has some
// serious bugs.
const firstProxy = new ProxyServer();
const firstProxyServer = http.createServer(async function(req, res) {
const realOriginHeaderName = 'x-real-origin'
const realOrigin = [req.headers[realOriginHeaderName]].flat()[0] // Header value could be a string or an
// array of strings so unwrap if array
if (!realOrigin) {
throw Error(`Received request without ${realOriginHeaderName} header`)
}
req.headers.host = new URL(realOrigin).hostname
req.url = realOrigin + req.url
firstProxy.web(req, res, { target: secondProxyServerAddress, toProxy: true, prependPath: false });
}).listen(0);
const firstProxyServerAddress = firstProxyServer.address() Finally requests send out by the first proxy are received by the second/caching proxy: import { HttpCachingProxy } from '@loopback/http-caching-proxy';
import path from 'path';
const secondProxyServer = new HttpCachingProxy({
cachePath: path.resolve(__dirname, '.proxy-cache'),
ttl: 24 * 60 * 60 * 1000, // 1 day
});
await secondProxyServer.start();
const secondProxyServerAddress = secondProxyServer.url Again ugly! So I'd love to see support built into got-scraping but if anyone else is interested in this use-case hopefully this helps as a workaround. So far I haven't run into any TLS issues with the sites I'm using but your mileage may vary. |
I see what you are trying to achieve - but keep in mind that if a regular user is requesting a URL with the protocol I understand that most people using We might look into it in the future if we see more push for this feature. For now, I'll have to icebox this, as it seems very niche and potentially problematic. If you find the time, feel free to shoot us a PR - I'm pretty sure it's gonna be as easy as repurposing the |
Makes sense! Thanks for your thoughtful and considered response ❤️ If I end up getting some time to put together a PR I’ll reopen this but in the mean time I’ll close this to reduce noise on your end. Thanks again! |
I'd like to use got-scrapping with @loopback/http-caching-proxy (I have some tests that I want to be able to re-run many times without putting extra load on/getting blocked by the upstream servers I'm hitting). However @loopback/http-caching-proxy only supports the type of proxing where the full url is included in the HTTP request (e.g.
GET https://www.example.com/index.html HTTP/1.1
) rather than using the CONNECT method (e.g.CONNECT www.example.com:443 HTTP/1.1
). Presumably this is because the CONNECT type of proxing will simply forward encrypted HTTPS traffic between the client and the server and thus won't be able to cache it. However it seems like got-scraping doesn't support the non-CONNECT proxy type meaning there's no easy way to use a caching proxy server with it.Example:
The last command prints this:
(501 is the response @loopback/http-caching-proxy gives when it receives a CONNECT request)
So this ticket is a request to support non-CONNECT proxies. Thanks for your consideration and for such a handy library!
The text was updated successfully, but these errors were encountered: