Instance randomly not connecting to SSM #49

michael-kutsch · 2023-07-15T07:47:22Z

Bug Description

In some cases, the SSM agent takes more than the default stop-timeout of 5 minutes to connect to SSM, therefore the EC2 can come up, but the user is not able to create an ssm session.
Also, the SSM agent could have crashed for some reason, which renders the same result: you cannot connect.

Steps to Reproduce

Disclaimer: this is hard to reproduce as it depends on several factors that are out of control of the user.

basti init
basti connect
basti instance does not show up in session manager
connection times out
basti instance stops

Expected Behavior

Basti instance is usable via SSM

Current Behavior

See Steps to Reproduce

Possible Solution (Optional)

I see three options:

increase / make the default stop-timeout configurable
add additional reboot to the instance after init (manual reboot helped in my case)
(preferred option) make the basti instance check if it the SSM agent connected successfully so a session can be initialized. If not, (force) restart the SSM agent, wait X s, recheck. If basti instance cannot connect to SSM, perform reboot. If reboot does not help, terminate the instance.

Related Issues/PRs

none

The text was updated successfully, but these errors were encountered:

BohdanPetryshyn · 2023-07-15T09:37:00Z

@DrFunk-n-stein, thank you for submitting the issue!

First of all, I'd ask you to make sure you're using the latest version of Basti. Recently, I added additional retries on the client side (in basti connect) which solved the issue of the port forwarding session not starting when the bastion instance is online in SSM.

The problem you described is different from what I recently fixed. To be honest, I haven't noticed such SSM agent behavior even though we use Basti hundreds of times a day. However, this can happen for sure, and I think the best solution would be a variation of solution #3 you suggested.

I think the solution can be slightly simplified by rebooting the instance right after noticing the SSM agent malfunction (skipping the SSM agent restart).

I'd like to ask you if you want to become a contributor and introduce such a health check. This would really help the project🤗 Otherwise, I'll do this as soon as I have some free time.

BohdanPetryshyn · 2023-07-15T09:39:23Z

Regarding the stop timeout. basti connect command starts marking the Bastion instance as in use when it only starts trying to connect to it. So the instance shouldn't stop unless the Basti CLI gave up retrying or was manually stopped.

michael-kutsch · 2023-07-15T09:58:40Z

About becoming a contributor: yes please :D

This is a so common use case and I like the very simple UX of basti.

BohdanPetryshyn · 2023-07-15T10:21:32Z

It's always so nice when somebody volunteers to become a contributor. Thank you!

When can you start working on this?

michael-kutsch · 2023-07-15T10:53:30Z

Soonish 😅
(Father of two, I'll try to onboard myself tonight, first PR will take a while)

michael-kutsch · 2023-07-15T18:31:30Z

@BohdanPetryshyn can you assign this one to me please? I lack permissions as it seems like

michael-kutsch · 2023-07-16T10:10:14Z

Just confirmed it that it does not work in different setups.

The only common similarity that I could find is that the overall setup was using a central egress pattern which means that the instances' traffic is routed via a transit gateway to another AWS account that passes the traffic through central NAT gateways.

Maybe it's a latency or routing thing, nonetheless, manual restart of the instance did the trick again.

That is indeed an interesting issue 😅

I will apply private endpoints to the setup for the required services soon and test if this resolves it maybe already. Nonetheless, the ssm agent is not able to connect, so worth fixing this.

BohdanPetryshyn · 2023-07-28T07:44:04Z

Hi, @DrFunk-n-stein 👋

Are you still up to implementing the health check yourself?

michael-kutsch · 2023-07-28T11:02:04Z

Hi, @DrFunk-n-stein 👋

Are you still up to implementing the health check yourself?

Thanks for pinging me - feel free to hand it over to someone else.
My job and private schedule prevented me from putting time into it.
Sorry for that - I will let you know once I'll be available.

BohdanPetryshyn · 2023-08-02T15:37:56Z

No problem! Thank you for letting me know!

bobveringa · 2023-08-07T09:48:50Z

@DrFunk-n-stein Could you maybe provide the basti logs for this? Logs are stored in /var/log/basti/stop-if-not-used.log. Perhaps this is caused by some type of error.

michael-kutsch · 2023-08-07T10:29:51Z

I'll can check them the next time it happens.

But it's with a 99% certainty because the instance is not connected to session manager and shuts off after 5 mins.
I checked this via the console and aswcli that the instance is not showing up in session manager (I used https://pypi.org/project/aws-ssm-tools/ which is a nice wrapper for ssm commands).

On reboot, most of the times, the ssm agent can connect properly and then it works

maartenvanderhoef · 2024-07-16T12:18:10Z

The problem is the sheer speed of AWS.

First the Role and Instance get created
The instanceID and Role are then being used to create granular inline permissions for ec2:CreateTags

The problem with that is, that when the instance is fast which they are, the instance starts and does not have the ssm permissions active yet. Then the ssm-agent want to wait for 28 minutes or so, and well before that the instance shuts down by design.

Better would be to

Create the Role
Add the first inline permissions for ssm
Create the instance
Add the second inline permissions for Tagging

maartenvanderhoef · 2024-07-16T13:00:22Z

Or we can use a generic policy which allows an instance to tag itself like shown here:

https://unixorn.github.io/post/iam-self-tagging/

I can PR if there is support for it.

BohdanPetryshyn · 2024-07-22T19:43:50Z

Hey @maartenvanderhoef, that's a very nice catch! Thank you for figuring this out 👍

I would go with the "iam-self-tagging" approach you mentioned to simplify the code (to still create all the inline policies in one place and with a single log message). I would very much appreciate your help with implementing this stability improvement ❤️

michael-kutsch added the bug Something isn't working label Jul 15, 2023

BohdanPetryshyn assigned michael-kutsch Jul 15, 2023

BohdanPetryshyn unassigned michael-kutsch Aug 2, 2023

BohdanPetryshyn added the priority: low Lowest priority issue or pull request label Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance randomly not connecting to SSM #49

Instance randomly not connecting to SSM #49

michael-kutsch commented Jul 15, 2023 •

edited

Loading

BohdanPetryshyn commented Jul 15, 2023 •

edited

Loading

BohdanPetryshyn commented Jul 15, 2023

michael-kutsch commented Jul 15, 2023

BohdanPetryshyn commented Jul 15, 2023

michael-kutsch commented Jul 15, 2023

michael-kutsch commented Jul 15, 2023

michael-kutsch commented Jul 16, 2023

BohdanPetryshyn commented Jul 28, 2023

michael-kutsch commented Jul 28, 2023

BohdanPetryshyn commented Aug 2, 2023

bobveringa commented Aug 7, 2023 •

edited

Loading

michael-kutsch commented Aug 7, 2023 •

edited

Loading

maartenvanderhoef commented Jul 16, 2024

maartenvanderhoef commented Jul 16, 2024

BohdanPetryshyn commented Jul 22, 2024

Instance randomly not connecting to SSM #49

Instance randomly not connecting to SSM #49

Comments

michael-kutsch commented Jul 15, 2023 • edited Loading

Bug Description

Steps to Reproduce

Expected Behavior

Current Behavior

Possible Solution (Optional)

Related Issues/PRs

BohdanPetryshyn commented Jul 15, 2023 • edited Loading

BohdanPetryshyn commented Jul 15, 2023

michael-kutsch commented Jul 15, 2023

BohdanPetryshyn commented Jul 15, 2023

michael-kutsch commented Jul 15, 2023

michael-kutsch commented Jul 15, 2023

michael-kutsch commented Jul 16, 2023

BohdanPetryshyn commented Jul 28, 2023

michael-kutsch commented Jul 28, 2023

BohdanPetryshyn commented Aug 2, 2023

bobveringa commented Aug 7, 2023 • edited Loading

michael-kutsch commented Aug 7, 2023 • edited Loading

maartenvanderhoef commented Jul 16, 2024

maartenvanderhoef commented Jul 16, 2024

BohdanPetryshyn commented Jul 22, 2024

michael-kutsch commented Jul 15, 2023 •

edited

Loading

BohdanPetryshyn commented Jul 15, 2023 •

edited

Loading

bobveringa commented Aug 7, 2023 •

edited

Loading

michael-kutsch commented Aug 7, 2023 •

edited

Loading