Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad client not reporting pending job during GC #24777

Open
EtienneBruines opened this issue Jan 6, 2025 · 3 comments
Open

Nomad client not reporting pending job during GC #24777

EtienneBruines opened this issue Jan 6, 2025 · 3 comments

Comments

@EtienneBruines
Copy link
Contributor

EtienneBruines commented Jan 6, 2025

Nomad version

Nomad v1.9.4
BuildDate 2024-12-18T15:16:22Z
Revision 5e49fcd+CHANGES

Operating system and Environment details

Ubuntu 22.04.5 LTS on amd64

Issue

Situation:

  • Nomad client is busy doing GC (reason number of allocations (68) is over the limit (50))
  • Nomad client does not start new allocs because of that

Problems:

  • Nomad server / scheduler will still queue new jobs to this client, despite it being 'too busy' to handle anything new
  • The nomad.client.allocations.pending as reported by the client is set to 0, despite having active pending allocations (for over 20 minutes already)

Reproduction steps

  • Have a client with a lot of GC'able allocs (start a lot of them and set the GC interval to 24h or something)
  • Wait for a new job to be scheduled to this client
  • See the alloc status pending
  • See the nomad.client.allocations.pending metric being 0

Expected Result

Ideally (in this order):

But in any case:

  • The nomad.client.allocations.pending to report on the pending jobs

Actual Result

The nomad.client.allocations.pending metric is 0

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Only logs this:

{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.250494Z","alloc_id":"ac8fd9bd-39f9-133f-c1ae-eb45c1ecc275"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.252995Z","alloc_id":"feb5dc4c-a549-7b82-a18e-733acd2a7013","reason":"number of allocations (68) is over the limit (50)"}

After the GC-ing is complete (perhaps 20 minutes or so later), it starts the alloc and logs things like:

{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-06T10:23:26.213687Z","alloc_id":"db84c9fb-e9e3-df5e-bc34-42e11f57a32e","failed":false,"msg":"Task received by client","task":"sync","type":"Received"}
@EtienneBruines EtienneBruines changed the title Nomad client not reporting pending job Nomad client not reporting pending job during GC Jan 6, 2025
@pkazmierczak pkazmierczak moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jan 8, 2025
@pkazmierczak
Copy link
Contributor

Hi @EtienneBruines, I'm struggling to reproduce this. As noted in another issue of yours, client GC should asynchronously remove objects and not interfere with client's ability to place workloads. I'm starting to wonder that perhaps GC is a red herring here and the issue is elsewhere entirely?

Could you give more details about your cluster setup and your workloads?

@EtienneBruines
Copy link
Contributor Author

EtienneBruines commented Jan 14, 2025

Hi @pkazmierczak!

Workloads

We have a mix of three workloads:

  • Long-running services
  • Daily periodic jobs
  • Very frequent periodic jobs (every 1-5 minutes)

On average, no more than 1 job every two minutes is to be scheduled.

Cluster setup

We are running three servers in a single region and clients in that same region. Network latency from clients to servers is between 1 and 2 ms.

Our clients have gc_parallel_destroys = 20 to speed up GC.

This setting does not seem to work quite as well though:

{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:48.088920Z","alloc_id":"afefc59a-54e0-3c7f-c613-3e6dfbff3660","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:48.090911Z","alloc_id":"afefc59a-54e0-3c7f-c613-3e6dfbff3660"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:48.093002Z","alloc_id":"82af5cd9-a4fd-0559-2923-9da198f1902d","reason":"new allocations and over max (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:49.255128Z","alloc_id":"55fc114c-1e89-0982-dc0e-f2db0a0fcc8b","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:49.257872Z","alloc_id":"55fc114c-1e89-0982-dc0e-f2db0a0fcc8b"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:49.268933Z","alloc_id":"3511f2c6-3323-322d-2e4d-2cfe7061c799","reason":"number of allocations (360) is over the limit (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:58.095348Z","alloc_id":"82af5cd9-a4fd-0559-2923-9da198f1902d","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:58.097459Z","alloc_id":"82af5cd9-a4fd-0559-2923-9da198f1902d"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:58.099859Z","alloc_id":"bdd95802-229f-9ee7-0686-0487f0ac52f0","reason":"new allocations and over max (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:59.271863Z","alloc_id":"3511f2c6-3323-322d-2e4d-2cfe7061c799","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:59.274291Z","alloc_id":"3511f2c6-3323-322d-2e4d-2cfe7061c799"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:59.287862Z","alloc_id":"812257cc-618c-fdd6-4e8c-4e364b8aa397","reason":"number of allocations (358) is over the limit (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:33:08.105918Z","alloc_id":"bdd95802-229f-9ee7-0686-0487f0ac52f0","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:33:08.107486Z","alloc_id":"bdd95802-229f-9ee7-0686-0487f0ac52f0"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:33:08.111726Z","alloc_id":"3a65d2c4-b5c9-8993-3539-f6753f600389","reason":"new allocations and over max (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:33:09.289746Z","alloc_id":"812257cc-618c-fdd6-4e8c-4e364b8aa397","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:33:09.291815Z","alloc_id":"812257cc-618c-fdd6-4e8c-4e364b8aa397"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:33:09.301858Z","alloc_id":"9d084b71-12a5-ab26-8436-8ac8df1162d5","reason":"number of allocations (356) is over the limit (50)"}

Every 10 seconds the number of allocations drops by 2 (the default value and not 'our' value).

I'm starting to wonder that perhaps GC is a red herring here and the issue is elsewhere entirely?

The GC is definitely an issue. Probably closely related to #2463

The issue might be elsewhere; the GC is currently the only thing that 'shows'. As long as that number of allocations is lowering but still above 50, the client will not process any jobs assigned to it - nor will it report those in the metrics.

@pkazmierczak Perhaps some useful background info: #19917

@Juanadelacuesta
Copy link
Member

Hi @EtienneBruines Im unable to reproduce this issue with the current information, but it seems like there is a problem with GCing, as it is also mentioned in #24778 and #19917. Can you please share as much information as possible to help us see what you are seeing? Servers and Clients configuration as well as job specs? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants