Nomad client not reporting pending job during GC #24777

EtienneBruines · 2025-01-06T10:29:51Z

Nomad version

Nomad v1.9.4
BuildDate 2024-12-18T15:16:22Z
Revision 5e49fcd+CHANGES

Operating system and Environment details

Ubuntu 22.04.5 LTS on amd64

Issue

Situation:

Nomad client is busy doing GC (reason number of allocations (68) is over the limit (50))
Nomad client does not start new allocs because of that

Problems:

Nomad server / scheduler will still queue new jobs to this client, despite it being 'too busy' to handle anything new
The nomad.client.allocations.pending as reported by the client is set to 0, despite having active pending allocations (for over 20 minutes already)

Reproduction steps

Have a client with a lot of GC'able allocs (start a lot of them and set the GC interval to 24h or something)
Wait for a new job to be scheduled to this client
See the alloc status pending
See the nomad.client.allocations.pending metric being 0

Expected Result

Ideally (in this order):

The job not being scheduled on this node, since it cannot handle it (different issue) Nomad scheduler schedules job on client that cannot handle it #24779
The client starting the alloc anyways - if the scheduler thought it'd be a good idea, to just schedule it now Nomad doesn't start new allocs during GC #24778
The scheduler to re-schedule the job if it's pending for too long Reschedule long-pending allocs #24780

But in any case:

The nomad.client.allocations.pending to report on the pending jobs

Actual Result

The nomad.client.allocations.pending metric is 0

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Only logs this:

{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.250494Z","alloc_id":"ac8fd9bd-39f9-133f-c1ae-eb45c1ecc275"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.252995Z","alloc_id":"feb5dc4c-a549-7b82-a18e-733acd2a7013","reason":"number of allocations (68) is over the limit (50)"}

After the GC-ing is complete (perhaps 20 minutes or so later), it starts the alloc and logs things like:

{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-06T10:23:26.213687Z","alloc_id":"db84c9fb-e9e3-df5e-bc34-42e11f57a32e","failed":false,"msg":"Task received by client","task":"sync","type":"Received"}

The text was updated successfully, but these errors were encountered:

pkazmierczak · 2025-01-08T10:41:29Z

Hi @EtienneBruines, I'm struggling to reproduce this. As noted in another issue of yours, client GC should asynchronously remove objects and not interfere with client's ability to place workloads. I'm starting to wonder that perhaps GC is a red herring here and the issue is elsewhere entirely?

Could you give more details about your cluster setup and your workloads?

EtienneBruines · 2025-01-14T11:38:22Z

Hi @pkazmierczak!

Workloads

We have a mix of three workloads:

Long-running services
Daily periodic jobs
Very frequent periodic jobs (every 1-5 minutes)

On average, no more than 1 job every two minutes is to be scheduled.

Cluster setup

We are running three servers in a single region and clients in that same region. Network latency from clients to servers is between 1 and 2 ms.

Our clients have gc_parallel_destroys = 20 to speed up GC.

This setting does not seem to work quite as well though:

{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:48.088920Z","alloc_id":"afefc59a-54e0-3c7f-c613-3e6dfbff3660","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:48.090911Z","alloc_id":"afefc59a-54e0-3c7f-c613-3e6dfbff3660"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:48.093002Z","alloc_id":"82af5cd9-a4fd-0559-2923-9da198f1902d","reason":"new allocations and over max (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:49.255128Z","alloc_id":"55fc114c-1e89-0982-dc0e-f2db0a0fcc8b","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:49.257872Z","alloc_id":"55fc114c-1e89-0982-dc0e-f2db0a0fcc8b"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:49.268933Z","alloc_id":"3511f2c6-3323-322d-2e4d-2cfe7061c799","reason":"number of allocations (360) is over the limit (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:58.095348Z","alloc_id":"82af5cd9-a4fd-0559-2923-9da198f1902d","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:58.097459Z","alloc_id":"82af5cd9-a4fd-0559-2923-9da198f1902d"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:58.099859Z","alloc_id":"bdd95802-229f-9ee7-0686-0487f0ac52f0","reason":"new allocations and over max (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:32:59.271863Z","alloc_id":"3511f2c6-3323-322d-2e4d-2cfe7061c799","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:32:59.274291Z","alloc_id":"3511f2c6-3323-322d-2e4d-2cfe7061c799"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:32:59.287862Z","alloc_id":"812257cc-618c-fdd6-4e8c-4e364b8aa397","reason":"number of allocations (358) is over the limit (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:33:08.105918Z","alloc_id":"bdd95802-229f-9ee7-0686-0487f0ac52f0","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:33:08.107486Z","alloc_id":"bdd95802-229f-9ee7-0686-0487f0ac52f0"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:33:08.111726Z","alloc_id":"3a65d2c4-b5c9-8993-3539-f6753f600389","reason":"new allocations and over max (50)"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-14T11:33:09.289746Z","alloc_id":"812257cc-618c-fdd6-4e8c-4e364b8aa397","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-14T11:33:09.291815Z","alloc_id":"812257cc-618c-fdd6-4e8c-4e364b8aa397"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-14T11:33:09.301858Z","alloc_id":"9d084b71-12a5-ab26-8436-8ac8df1162d5","reason":"number of allocations (356) is over the limit (50)"}

Every 10 seconds the number of allocations drops by 2 (the default value and not 'our' value).

I'm starting to wonder that perhaps GC is a red herring here and the issue is elsewhere entirely?

The GC is definitely an issue. Probably closely related to #2463

The issue might be elsewhere; the GC is currently the only thing that 'shows'. As long as that number of allocations is lowering but still above 50, the client will not process any jobs assigned to it - nor will it report those in the metrics.

@pkazmierczak Perhaps some useful background info: #19917

Juanadelacuesta · 2025-01-21T17:28:04Z

Hi @EtienneBruines Im unable to reproduce this issue with the current information, but it seems like there is a problem with GCing, as it is also mentioned in #24778 and #19917. Can you please share as much information as possible to help us see what you are seeing? Servers and Clients configuration as well as job specs? Thank you!

EtienneBruines added the type/bug label Jan 6, 2025

EtienneBruines changed the title ~~Nomad client not reporting pending job~~ Nomad client not reporting pending job during GC Jan 6, 2025

EtienneBruines mentioned this issue Jan 6, 2025

Nomad scheduler schedules job on client that cannot handle it #24779

Open

jrasell added this to Nomad - Community Issues Triage Jan 7, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jan 7, 2025

pkazmierczak moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jan 8, 2025

pkazmierczak added stage/waiting-reply theme/scheduling labels Jan 8, 2025

pkazmierczak added stage/needs-investigation and removed stage/waiting-reply labels Jan 20, 2025

pkazmierczak moved this from Triaging to Needs Triage in Nomad - Community Issues Triage Jan 20, 2025

Juanadelacuesta self-assigned this Jan 20, 2025

Juanadelacuesta moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jan 21, 2025

Juanadelacuesta added stage/waiting-reply and removed stage/needs-investigation labels Jan 21, 2025

Juanadelacuesta marked this as a duplicate of #24778 Jan 21, 2025

Juanadelacuesta moved this from Triaging to In Progress in Nomad - Community Issues Triage Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad client not reporting pending job during GC #24777

Nomad client not reporting pending job during GC #24777

EtienneBruines commented Jan 6, 2025 •

edited

Loading

pkazmierczak commented Jan 8, 2025

EtienneBruines commented Jan 14, 2025 •

edited

Loading

Juanadelacuesta commented Jan 21, 2025

Nomad client not reporting pending job during GC #24777

Nomad client not reporting pending job during GC #24777

Comments

EtienneBruines commented Jan 6, 2025 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

pkazmierczak commented Jan 8, 2025

EtienneBruines commented Jan 14, 2025 • edited Loading

Workloads

Cluster setup

Juanadelacuesta commented Jan 21, 2025

EtienneBruines commented Jan 6, 2025 •

edited

Loading

EtienneBruines commented Jan 14, 2025 •

edited

Loading