-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of open connections resulting in network errors #384
Comments
Many thanks for investigating and reporting @rob-ellison-jet. I’m happy to get your fix into next release - if you’d like to be credited, could you make a PR based on develop, and sign the CLA (if you haven’t signed before a bot will comment on your PR)? Alternatively if you’d prefer we just take it from here, let me know - I can make a quick PR myself. We have a bit of ongoing testing for the other features in 3.1.0 before we make a full release, but I’ll gladly push an rc image to unblock you in the meantime. Thanks again for reporting! |
Hi @colmsnowplow, I've created the PR here: I had some tests too but tried a few things but couldn't get these working. |
Super, thanks so much rob! No hassle on tests - I'll take a look and will give it at least a manual test before I push a release candidate. Hoping to fit that in today if all goes well, bear with me. |
@rob-ellison-jet I have now deployed to the Note that I have manually tested locally only so far, I won’t get to deploying and testing under load today, but I’ll let you know once I’ve got that done. Shouldn't be more than a couple of days at worst. |
Thanks @colmsnowplow, I haven't tested the change to 3.0.0 under load yet. I only run a load test against the patch to 2.3.0. Currently we can't use 3.1.0-rc2 in production. I could run a load test on it in staging though. I was planning to deploy the patch 2.3.0 use until we have license sorted. Are there any considerations for the build environment that we need to take into account? Can you think of anything we might miss? |
Ah, sorry I had totally missed the context of the licence! Apologies, I follow you now.
I recently tried to build and deploy a test asset from local, but ran into issues because my machine is an M1 mac, and I struggled to build the linux-compatible image locally. This is not an insurmountable problem, there’s definitely a solve for it, but I don’t know how hard it is to solve, because I changed tack and just deployed from CI instead. So I’m limited in how helpful I can be in guiding you in taking that approach. But I can suggest two options to make life easier:
One approach you could take, is to populate the dockerhub secrets in your forked repo, and try deploying that way. You’d probably need to comment out the integration test that uses an ngrok token (http target TLS test IIRC) - and probably also the snyk steps that require a token. IMO if it’s workable, this is the best option from your perspective for now, because it gives you the ability to easily tweak stuff and deploy further patches should you need do so as you iterate/test. To trigger the deploy with the GH action file in the repo, you’d just need to make sure VERSION and the version variable in cmd/constants.go match, and then tag with that same version.
The above might be painful/difficult to make work - I’m not sure. If you run into issues, make a PR and I’ll be happy to bring it into a non-prod branch and deploy an image to our dockerhub for you. The idea here would be that it’s just to facilitate unblocking you - because it’s a legacy version and we have our hands fairly full at the moment, I don’t think we’d be able to put time into testing that asset or working on progressing towards a release. (editing to clarify: we'll still progress the v3 changes to release, and test those, regardless. What we'd struggle to fit in is a 2.3.0-compatible legacy release) If I understand things I think you might be ok with this as a temporary measure since you’ve done a fair bit of testing already yourself. — If neither of these work for you let me know.
No need - we need to (and I want to!) regardless - I was just being explicit about what was tested - but I didn’t realise you wouldn’t be able to push v3 to prod. |
If you could go with option 2 then that would be the be the most risk free option. I've had some difficulties in try to get option 1 working fully wil all the e2e and integration tests. This is the Branch I have with the changes: If you could create a branch I could create a PR. |
Hi @colmsnowplow, just wanted to check you saw this. I spend quite some time already trying the first approach. This is proving quite difficult though. I totally get that creating a new release for an old version involves a lot of work. Ignore me I just spotted the image on Dockerhub! |
Hi @colmsnowplow, I tested the change related to this PR wrapping in a function and also including defer on the Close body. This didn't work as optimally as the original solution suggested in Issue description. Really sorry to ask again but would you be able to upload an image from my latest branch change (i will not ask again)? |
Hey @rob-ellison-jet sorry for the radio silence! There have been parallel conversations among the business-side people, and they've let me know that there's now a plan to get you on the path to 3.1.x, which hopefully should be the simpler solution. Of course I'm happy to support further where needed - let's co-ordinate through that channel. Leaving the issue open until we get the fix into full 3.1.0 release |
Hey @rob-ellison-jet, we've been experimenting a bit with HTTP client and came up with these changes #388. Apart from draining/closing response body (in defer function though) it also contains some changes in the HTTP transport configuration. These changes have been released with If you could give it a spin and let us know how it works on your environment - that would be awesome :) |
Thanks @pondzix, I'll test this most likely next week and report back. |
Hey @pondzix @colmsnowplow , It looks like the downstream connections have dramatically increased now based on the rc6 image and setting the We are till seeing quite a high volume of upstream connections though to kinesis that may need to be addressed but this is the source. I'm not sure if you can set similar limits there. / $ netstat -atn | grep :443 | wc -l
3567 It seems that you could also set a custom HTTP Client when creating the AWS session. This might help to limit the number of upstream connections also. httpClient := &http.Client{
Transport: &http.Transport{
MaxIdleConns: 100, // Maximum number of idle connections
IdleConnTimeout: 90 * time.Second,
MaxIdleConnsPerHost: 10, // Maximum idle connections per host
},
}
// Initialize the session with the custom HTTP client
sess = session.Must(session.NewSessionWithOptions(session.Options{
SharedConfigState: session.SharedConfigEnable,
Config: aws.Config{
Region: aws.String(region),
Endpoint: aws.String(endpoint),
HTTPClient: httpClient, // Assign the custom HTTP client
},
}))
// If roleARN is provided, assume the role
if roleARN != "" {
creds := stscreds.NewCredentials(sess, roleARN)
cfg = &aws.Config{
Credentials: creds,
Region: aws.String(region),
HTTPClient: httpClient, // Reuse the same custom HTTP client
}
} |
PR to address this: #391 |
Thanks @rob-ellison-jet , this is interesting idea and we'll definitely take a look at this! Right now we'd like to proceed with 3.1.0 release which would include changes from 3.1.0-rc5, so limiting/pooling downstream connections, without limiting kinesis connections. After 3.1.0 releasing, we will analyze/test your proposal and if it looks good - include in the 3.2.0 release :) |
Nice, thanks @pondzix ! Apologies for hardcoding the connections. I did try making the connections customisable but the problem is that the AWS session is created once. Whereas, in the config there are many configs for both source and target. The config may need to happen at the root level. |
We've experienced issues with the snowbridge errors due to
connection reset by peer
also 404 responses on the http target. The problem is due to the number of open connections on the pods.Source = kinesis
Target = http
The http target in our case is Google Tag Manager Server-side.
Generate sufficient constant load onto the kinesis stream of approximately 4K records per minute. After a few hours you should start to get either connection reset by peer from aws or 404's from GTM.
Check the number of open connections on the box.
We're currently using version 2.3.0. The main culprit appears to be following line in the http target.
Applying the following patch drastically reduces the number of open connections:
In staging we have also tested using the latest version 3.0.0 and have experienced similar issues there also.
The text was updated successfully, but these errors were encountered: