Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance randomly not connecting to SSM #49

Open
michael-kutsch opened this issue Jul 15, 2023 · 15 comments
Open

Instance randomly not connecting to SSM #49

michael-kutsch opened this issue Jul 15, 2023 · 15 comments
Labels
bug Something isn't working priority: low Lowest priority issue or pull request

Comments

@michael-kutsch
Copy link

michael-kutsch commented Jul 15, 2023

Bug Description

In some cases, the SSM agent takes more than the default stop-timeout of 5 minutes to connect to SSM, therefore the EC2 can come up, but the user is not able to create an ssm session.
Also, the SSM agent could have crashed for some reason, which renders the same result: you cannot connect.

Steps to Reproduce

Disclaimer: this is hard to reproduce as it depends on several factors that are out of control of the user.

  1. basti init
  2. basti connect
  3. basti instance does not show up in session manager
  4. connection times out
  5. basti instance stops

Expected Behavior

Basti instance is usable via SSM

Current Behavior

See Steps to Reproduce

Possible Solution (Optional)

I see three options:

  1. increase / make the default stop-timeout configurable

  2. add additional reboot to the instance after init (manual reboot helped in my case)

  3. (preferred option) make the basti instance check if it the SSM agent connected successfully so a session can be initialized. If not, (force) restart the SSM agent, wait X s, recheck. If basti instance cannot connect to SSM, perform reboot. If reboot does not help, terminate the instance.

Related Issues/PRs

none

@michael-kutsch michael-kutsch added the bug Something isn't working label Jul 15, 2023
@BohdanPetryshyn
Copy link
Collaborator

BohdanPetryshyn commented Jul 15, 2023

@DrFunk-n-stein, thank you for submitting the issue!

First of all, I'd ask you to make sure you're using the latest version of Basti. Recently, I added additional retries on the client side (in basti connect) which solved the issue of the port forwarding session not starting when the bastion instance is online in SSM.

The problem you described is different from what I recently fixed. To be honest, I haven't noticed such SSM agent behavior even though we use Basti hundreds of times a day. However, this can happen for sure, and I think the best solution would be a variation of solution #3 you suggested.

I think the solution can be slightly simplified by rebooting the instance right after noticing the SSM agent malfunction (skipping the SSM agent restart).

I'd like to ask you if you want to become a contributor and introduce such a health check. This would really help the project🤗 Otherwise, I'll do this as soon as I have some free time.

@BohdanPetryshyn
Copy link
Collaborator

Regarding the stop timeout. basti connect command starts marking the Bastion instance as in use when it only starts trying to connect to it. So the instance shouldn't stop unless the Basti CLI gave up retrying or was manually stopped.

@michael-kutsch
Copy link
Author

About becoming a contributor: yes please :D

This is a so common use case and I like the very simple UX of basti.

@BohdanPetryshyn
Copy link
Collaborator

It's always so nice when somebody volunteers to become a contributor. Thank you!

When can you start working on this?

@michael-kutsch
Copy link
Author

Soonish 😅
(Father of two, I'll try to onboard myself tonight, first PR will take a while)

@michael-kutsch
Copy link
Author

@BohdanPetryshyn can you assign this one to me please? I lack permissions as it seems like

@michael-kutsch
Copy link
Author

Just confirmed it that it does not work in different setups.

The only common similarity that I could find is that the overall setup was using a central egress pattern which means that the instances' traffic is routed via a transit gateway to another AWS account that passes the traffic through central NAT gateways.

Maybe it's a latency or routing thing, nonetheless, manual restart of the instance did the trick again.

That is indeed an interesting issue 😅

I will apply private endpoints to the setup for the required services soon and test if this resolves it maybe already. Nonetheless, the ssm agent is not able to connect, so worth fixing this.

@BohdanPetryshyn
Copy link
Collaborator

Hi, @DrFunk-n-stein 👋

Are you still up to implementing the health check yourself?

@michael-kutsch
Copy link
Author

Hi, @DrFunk-n-stein 👋

Are you still up to implementing the health check yourself?

Thanks for pinging me - feel free to hand it over to someone else.
My job and private schedule prevented me from putting time into it.
Sorry for that - I will let you know once I'll be available.

@BohdanPetryshyn
Copy link
Collaborator

No problem! Thank you for letting me know!

@bobveringa
Copy link
Contributor

bobveringa commented Aug 7, 2023

@DrFunk-n-stein Could you maybe provide the basti logs for this? Logs are stored in /var/log/basti/stop-if-not-used.log. Perhaps this is caused by some type of error.

@michael-kutsch
Copy link
Author

michael-kutsch commented Aug 7, 2023

I'll can check them the next time it happens.

But it's with a 99% certainty because the instance is not connected to session manager and shuts off after 5 mins.
I checked this via the console and aswcli that the instance is not showing up in session manager (I used https://pypi.org/project/aws-ssm-tools/ which is a nice wrapper for ssm commands).

On reboot, most of the times, the ssm agent can connect properly and then it works

@BohdanPetryshyn BohdanPetryshyn added the priority: low Lowest priority issue or pull request label Aug 31, 2023
@maartenvanderhoef
Copy link

The problem is the sheer speed of AWS.

First the Role and Instance get created
The instanceID and Role are then being used to create granular inline permissions for ec2:CreateTags

The problem with that is, that when the instance is fast which they are, the instance starts and does not have the ssm permissions active yet. Then the ssm-agent want to wait for 28 minutes or so, and well before that the instance shuts down by design.

Better would be to

  1. Create the Role
  2. Add the first inline permissions for ssm
  3. Create the instance
  4. Add the second inline permissions for Tagging

@maartenvanderhoef
Copy link

Or we can use a generic policy which allows an instance to tag itself like shown here:

https://unixorn.github.io/post/iam-self-tagging/

I can PR if there is support for it.

@BohdanPetryshyn
Copy link
Collaborator

Hey @maartenvanderhoef, that's a very nice catch! Thank you for figuring this out 👍

I would go with the "iam-self-tagging" approach you mentioned to simplify the code (to still create all the inline policies in one place and with a single log message). I would very much appreciate your help with implementing this stability improvement ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: low Lowest priority issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants