-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance randomly not connecting to SSM #49
Comments
@DrFunk-n-stein, thank you for submitting the issue! First of all, I'd ask you to make sure you're using the latest version of Basti. Recently, I added additional retries on the client side (in The problem you described is different from what I recently fixed. To be honest, I haven't noticed such SSM agent behavior even though we use Basti hundreds of times a day. However, this can happen for sure, and I think the best solution would be a variation of solution #3 you suggested. I think the solution can be slightly simplified by rebooting the instance right after noticing the SSM agent malfunction (skipping the SSM agent restart). I'd like to ask you if you want to become a contributor and introduce such a health check. This would really help the project🤗 Otherwise, I'll do this as soon as I have some free time. |
Regarding the stop timeout. |
About becoming a contributor: yes please :D This is a so common use case and I like the very simple UX of |
It's always so nice when somebody volunteers to become a contributor. Thank you! When can you start working on this? |
Soonish 😅 |
@BohdanPetryshyn can you assign this one to me please? I lack permissions as it seems like |
Just confirmed it that it does not work in different setups. The only common similarity that I could find is that the overall setup was using a central egress pattern which means that the instances' traffic is routed via a transit gateway to another AWS account that passes the traffic through central NAT gateways. Maybe it's a latency or routing thing, nonetheless, manual restart of the instance did the trick again. That is indeed an interesting issue 😅 I will apply private endpoints to the setup for the required services soon and test if this resolves it maybe already. Nonetheless, the ssm agent is not able to connect, so worth fixing this. |
Hi, @DrFunk-n-stein 👋 Are you still up to implementing the health check yourself? |
Thanks for pinging me - feel free to hand it over to someone else. |
No problem! Thank you for letting me know! |
@DrFunk-n-stein Could you maybe provide the basti logs for this? Logs are stored in |
I'll can check them the next time it happens. But it's with a 99% certainty because the instance is not connected to session manager and shuts off after 5 mins. On reboot, most of the times, the ssm agent can connect properly and then it works |
The problem is the sheer speed of AWS. First the Role and Instance get created The problem with that is, that when the instance is fast which they are, the instance starts and does not have the ssm permissions active yet. Then the ssm-agent want to wait for 28 minutes or so, and well before that the instance shuts down by design. Better would be to
|
Or we can use a generic policy which allows an instance to tag itself like shown here: https://unixorn.github.io/post/iam-self-tagging/ I can PR if there is support for it. |
Hey @maartenvanderhoef, that's a very nice catch! Thank you for figuring this out 👍 I would go with the "iam-self-tagging" approach you mentioned to simplify the code (to still create all the inline policies in one place and with a single log message). I would very much appreciate your help with implementing this stability improvement ❤️ |
Bug Description
In some cases, the SSM agent takes more than the default stop-timeout of 5 minutes to connect to SSM, therefore the EC2 can come up, but the user is not able to create an ssm session.
Also, the SSM agent could have crashed for some reason, which renders the same result: you cannot connect.
Steps to Reproduce
Disclaimer: this is hard to reproduce as it depends on several factors that are out of control of the user.
Expected Behavior
Basti instance is usable via SSM
Current Behavior
See
Steps to Reproduce
Possible Solution (Optional)
I see three options:
increase / make the default stop-timeout configurable
add additional reboot to the instance after init (manual reboot helped in my case)
(preferred option) make the basti instance check if it the SSM agent connected successfully so a session can be initialized. If not, (force) restart the SSM agent, wait X s, recheck. If basti instance cannot connect to SSM, perform reboot. If reboot does not help, terminate the instance.
Related Issues/PRs
none
The text was updated successfully, but these errors were encountered: