Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pending allocations stuck in pending state after adoption by a new deployment #23305

Open
lattwood opened this issue Jun 11, 2024 · 22 comments · Fixed by hashicorp/yamux#127
Open

Comments

@lattwood
Copy link
Contributor

lattwood commented Jun 11, 2024

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

All Linux

Issue

  • We have a large Nomad cluster- > 2k agents.
  • We run health checks on each agent and do dynamic metadata updates.
  • We have a large job that gets scheduled on 99% of the healthy agents.
  • We use a shell script to query nomad for node counts matching a filter on relevant dynamic metadata & run nomad job scale JOB TG COUNT when it changes.
  • We have a way to identify if the currently running deployment is due to a new Docker Image, or due to a scaling activity.
  • If it's due to a scaling activity on another task group, we cancel that deployment, and issue a new nomad job scale command immediately afterwards.
    • Due to how frequently the dynamic metadata is changing, we are constantly doing deployments.
    • We kicked off 250 deployments in the last hour.
  • Existing allocations usually all get adopted by the new deployment.
  • The issues:
    • most of the time, but not always, the version on the existing allocs that are still "Pending" at deployment cancelation are updated. but we end up with allocations that don't get the version updated.
    • sometimes the allocation exists, but isn't assigned to a client in the UI. if you inspect the alloc with the Nomad CLI, it tells you which client it has been assigned to.
    • allocations get stuck in the pending state. we've found sending a SIGQUIT to the Nomad Agent fixes this, but that doesn't scale to > 2k nodes.
    • allocations will get rescheduled before they ever leave the pending state. YEAH.
    • possibly related, but might not be- if a node heartbeat is missed and reregistered, we can end up with stuck pending jobs on a node as well. This is also fixed with a SIGQUIT.
      • The stuck jobs aspect is brutal, since that doesn't start the health/deployment timers as far as I can tell.
      • The fact restarting the nomad agent is a little unnerving. I vaguely recall seeing something in the logs about "allocations out of date" or something like that when this issue has manifested itself in the past.

Reproduction steps

  • run a big cluster of nodes with intermittent connectivity issues to the master.
  • run a big long running job on said cluster
  • constantly be deploying/scaling the jobspec
  • wait

Expected Result

  • nomad schedules and runs jobs

Actual Result

  • nomad doesn't

Screenshots of Absurdity

In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.

Additionally, you can see the Modified time is the same for ones staying up to date, but isn't changing on other ones- you can also see the Created times are all over the place.

Screenshot 2024-06-11 at 3 35 54 PM

In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And neither of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.

Screenshot 2024-06-11 at 4 52 00 PM

Reporter's speculation

Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.

Nomad logs

Useless, unfortunately, due to this bug: #22431

@lattwood
Copy link
Contributor Author

lattwood commented Jun 11, 2024

Screenshot 2024-06-11 at 5 44 59 PM

in trying to kick the broken allocs, I've managed to get Nomad to try sending an interrupt to a pending allocation!

edit: had to killall -9 nomad on ca11.

@lattwood
Copy link
Contributor Author

I figured out how to make it worse! If I drain the node and mark it as ineligible, then re-enable eligibility, all of the system jobs end up with additional allocations. The only solution here is restarting/killing the Nomad agent.

Screenshot 2024-06-11 at 6 09 06 PM

@lattwood
Copy link
Contributor Author

Just adding for additional flavour/I found this hilarious-

image

@lattwood
Copy link
Contributor Author

lattwood commented Jun 12, 2024

Pending alloc example with logs from the Nomad Agent. Times are all accurate to each other.

Screenshot 2024-06-12 at 10 54 36 AM
root@HOSTNAME:~# TZ=America/Halifax journalctl --unit nomad --since '14 hours ago' | grep -v '\(runner\)'
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"received EOF, stopping recv loop","@module":"client.driver_mgr.docker.docker_logger.stdio","@timestamp":"2024-06-12T02:28:35.213106Z","driver":"docker","err":"rpc error: code = Unavailable desc = error reading from server: EOF"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.224019Z","driver":"docker","id":"2362096","plugin":"/usr/local/bin/nomad"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.225535Z","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.430987Z","driver":"docker","image_id":"sha256:31d1924d3c7e119a450ce9d0d2fbd417a1aba168781625acbc7bf77b398cee33","references":0}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"(watcher) stopping all views","@module":"agent","@timestamp":"2024-06-12T02:28:35.472850Z"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"stopped container","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.610787Z","container_id":"02ae65871d36198b058a7b98387c5b40bc857e7dfc8eb6bd6de60407e6f28305","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"received EOF, stopping recv loop","@module":"client.driver_mgr.docker.docker_logger.stdio","@timestamp":"2024-06-12T02:28:35.626094Z","driver":"docker","err":"rpc error: code = Unavailable desc = error reading from server: EOF"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"error","@message":"error encountered while scanning stdout","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.631515Z","driver":"docker","error":"read |0: file already closed"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.631600Z","driver":"docker","id":"2361837","plugin":"/usr/local/bin/nomad"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.632337Z","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.654011Z","driver":"docker","image_id":"sha256:6331aa0368327f0829316ca10b45844eb7303caf5f63cc979063df0ef7ffca01","references":0}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2024-06-12T02:28:35.669645Z","alloc_id":"221a2ab1-43ea-22bb-1616-24d9ce047609"}
Jun 11 23:31:40 HOSTNAME nomad[3215]: {"@level":"debug","@message":"cleanup removed downloaded image","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:31:40.406819Z","driver":"docker","image_id":"sha256:31d1924d3c7e119a450ce9d0d2fbd417a1aba168781625acbc7bf77b398cee33"}
Jun 11 23:31:41 HOSTNAME nomad[3215]: {"@level":"debug","@message":"cleanup removed downloaded image","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:31:41.201500Z","driver":"docker","image_id":"sha256:6331aa0368327f0829316ca10b45844eb7303caf5f63cc979063df0ef7ffca01"}

From what I can tell, all the logs from ~23:38:35 are due to the failed allocation, and there is nothing in the agent's logs about the pending allocation.

Generally speaking, forcibly killing Nomad (we have draining enabled) tends to fix these issues, but these state inconsistency bugs are terrifying.

@jrasell
Copy link
Member

jrasell commented Jun 12, 2024

Hi @lattwood and thanks for raising this issue with the detail included. I'll add this to our backlog to investigate further. In the meantime, if you discover any other information or concrete reproduction, please let us know.

@lattwood
Copy link
Contributor Author

More log spelunking- after this shows up in the logs on an agent/client, no further alloc updates occur, and the drain issue with the pending allocs also starts occuring too.

Jun 11 14:50:29 HOSTNAME nomad[2835]: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2024-06-11T14:50:29.411617Z","added":0,"errors":0,"ignored":4,"removed":0,"updated":0}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"yamux: keepalive failed: i/o deadline reached","@module":"client","@timestamp":"2024-06-11T14:50:50.303589Z"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"yamux: Failed to read stream data: read tcp [IPv6_ADDRESS_1]:42188-\u003e[IPv6_ADDRESS_2]:4647: use of closed network connection","@module":"client","@timestamp":"2024-06-11T14:50:50.303834Z"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.304920Z","error":"rpc error: EOF","rpc":"Node.UpdateAlloc","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.305307Z","error":"rpc error: EOF","rpc":"Node.UpdateAlloc","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error updating allocations","@module":"client","@timestamp":"2024-06-11T14:50:50.305578Z","error":"rpc error: EOF"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.305401Z","error":"rpc error: EOF","rpc":"Alloc.GetAllocs","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}

@lattwood
Copy link
Contributor Author

lattwood commented Jun 12, 2024

grabbed a goroutine stack dump and found a clue.

the same node is blocked here, and was for 1489 minutes, which ends up being just after the "error performing RPC to server" stuff in the above comment: https://github.com/hashicorp/yamux/blob/6c5a7317d6e3b6e7f85db696d6c83ed353e7cb4c/stream.go#L145-L153

Maybe timeouts aren't getting set on retries? idk.

edit: the deserialization methods for a jobspec are also evident in the stack trace.

Jun 12 15:40:35 HOSTNAME nomad[2835]: goroutine 254 gp=0xc0006888c0 m=nil [select, 1489 minutes]:
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.gopark(0xc0024a8d00?, 0x2?, 0x25?, 0xa9?, 0xc0024a8c94?)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/proc.go:402 +0xce fp=0xc0024a8b38 sp=0xc0024a8b18 pc=0x44630e
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.selectgo(0xc0024a8d00, 0xc0024a8c90, 0xc003ed46f0?, 0x0, 0xc0024a8cd8?, 0x1)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/select.go:327 +0x725 fp=0xc0024a8c58 sp=0xc0024a8b38 pc=0x458385
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/yamux.(*Stream).Read(0xc0032fa000, {0xc002cc1000, 0x1000, 0xc001cc0b00?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/[email protected]/stream.go:145 +0x40e fp=0xc0024a8d50 sp=0xc0024a8c58 pc=0xdb2ace
Jun 12 15:40:35 HOSTNAME nomad[2835]: bufio.(*Reader).Read(0xc002081e00, {0xc001cc0b00, 0xb, 0xdf9db0?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         bufio/bufio.go:241 +0x197 fp=0xc0024a8d88 sp=0xc0024a8d50 pc=0x6ca857
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.decReadFull({0x3970800, 0xc002081e00}, {0xc001cc0b00, 0xb, 0x40})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/[email protected]/codec/decode.go:3086 +0xa3 fp=0xc0024a8dd8 sp=0xc0024a8d88 pc=0xdfd303
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readb(0xc0035bde00, {0xc001cc0b00, 0xb, 0x7ed07119d5b8?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/[email protected]/codec/decode.go:422 +0x4d fp=0xc0024a8e28 sp=0xc0024a8dd8 pc=0xdf1aed
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.(*decReaderSwitch).readbIO(0xc002081e00?, {0xc001cc0b00?, 0x0?, 0x17?})
...
elided for brevity
...
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).rpc(0xc00091f888, {0x3112c54, 0xf}, {0x2f9c6c0, 0xc000485d90}, {0x2c6b860, 0xc0008f1860})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:104 +0x23e fp=0xc0024a9988 sp=0xc0024a9818 pc=0x1dcdb9e
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).RPC(...)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:61
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).rpc(0xc00091f888, {0x3112c54, 0xf}, {0x2f9c6c0, 0xc000485d90}, {0x2c6b860, 0xc0008f1860})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:154 +0xb2d fp=0xc0024a9af8 sp=0xc0024a9988 pc=0x1dce48d
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).RPC(...)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:61
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).watchAllocations(0xc00091f888, 0xc0008f1740)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:2395 +0x96e fp=0xc0024a9fc0 sp=0xc0024a9af8 pc=0x1dacf6e
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).run.gowrap1()
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:1850 +0x25 fp=0xc0024a9fe0 sp=0xc0024a9fc0 pc=0x1da9d65
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.goexit({})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc0024a9fe8 sp=0xc0024a9fe0 pc=0x47ebe1
Jun 12 15:40:35 HOSTNAME nomad[2835]: created by github.com/hashicorp/nomad/client.(*Client).run in goroutine 247
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:1850 +0x86

@lattwood
Copy link
Contributor Author

@jrasell I found the bug, it was in yamux. PR: hashicorp/yamux#127

@lattwood
Copy link
Contributor Author

I'm suspecting this bug is caused by the whole pending allocations

@jrasell
Copy link
Member

jrasell commented Jun 14, 2024

Hi @lattwood, amazing work and thanks so much for digging into this and raising a PR. I'll poke around internally and see who I need to get a reviewers on the PR.

@lattwood
Copy link
Contributor Author

@jrasell good stuff. i'm likely going to be cutting a 1.7.7-dev or 1.7.8 build with the yamux change and rolling it out on our side today, once i change the return to a goto.

@lattwood
Copy link
Contributor Author

Just added some more bug/correctness fixes to the PR.

@lattwood
Copy link
Contributor Author

https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork

I threw a build up there with the yamux fix.

@lattwood
Copy link
Contributor Author

lattwood commented Jun 20, 2024

Confirmed that hashicorp/yamux#127 fixes the issue.

We'll be running a forked Nomad build until there's a release of Nomad with the yamux fix in it.

edit: idk if I'm out of line for saying this, but the yamux fix probably should be backported everywhere it's in use 😅

@schmichael
Copy link
Member

Did you have TLS configured for Nomad's RPC?

@lattwood
Copy link
Contributor Author

@schmichael yup. (our nomad agents are distributed around the world)

schmichael added a commit to schmichael/yamux that referenced this issue Aug 14, 2024
Specifically to debug hashicorp/nomad#23305 but tests should probably
run against multiple net.Conn implementations as yamux is sensitive to
net.Conn behaviors such as concurrent Read|Write|Close and what errors
are returned.
@lattwood
Copy link
Contributor Author

Just saw another issue opened on yamux this week that could be responsible for what we're seeing here as well- hashicorp/yamux#133

schmichael added a commit to schmichael/yamux that referenced this issue Sep 5, 2024
Specifically to debug hashicorp/nomad#23305 but tests should probably
run against multiple net.Conn implementations as yamux is sensitive to
net.Conn behaviors such as concurrent Read|Write|Close and what errors
are returned.
@lattwood
Copy link
Contributor Author

lattwood commented Sep 5, 2024 via email

@schmichael
Copy link
Member

hmm, i don't think so? heh.

🤦 Sorry. Evidently Github treats "Should fix #..." as meaning "Absolutely does fix #..." Reopening.

@schmichael schmichael reopened this Sep 5, 2024
@github-project-automation github-project-automation bot moved this from Done to Needs Triage in Nomad - Community Issues Triage Sep 5, 2024
@Juanadelacuesta Juanadelacuesta moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Sep 6, 2024
schmichael added a commit to hashicorp/yamux that referenced this issue Sep 6, 2024
* test against tls.Conns, not pipes

Specifically to debug hashicorp/nomad#23305 but tests should probably
run against multiple net.Conn implementations as yamux is sensitive to
net.Conn behaviors such as concurrent Read|Write|Close and what errors
are returned.

* locally closed streams can still read

Effectively reverts 912e296

Reasons to revert to locally closed streams being readable:
1. Matches libp2p's yamux fork: https://github.com/libp2p/go-yamux/blob/master/stream.go#L95-L96
2. Both yamux and SPDY make it clear that a locally closed stream cannot send more data.
3. SPDY explicitly supports unidirectional streams where one peer is closed (readonly) from the beginning: https://www.chromium.org/spdy/spdy-protocol/spdy-protocol-draft3-1/#46-unidirectional-streams

* test: fix timing when using tcp

TestSession_PartialReadWindowUpdate asserts that large sends can cause
window updates. When using io.Pipes the server recieves the window
update nearly synchronously during the client's Read call. Using tcp
sockets adds just enough latency to reliably lose the race for observing
the window update on the server.

Added a sleep to ensure the server has time to read and apply the window
update from the client.

I also added a number of comments and non-functional cleanups. The
time.Sleep() is the only functional change.

* test: expand connection types tested

Expanded tests to use TCP and TLS. Sorry for the huge diff, but I think
this makes yamux's test suite much more realistic.

There are quite a few new tools:

- testConnPipe is the original io.Pipe based testing tool. Some tests
  still require it due to the ability to easily pause the data flow.
- testConnTCP and testConnTLS create TCP and TLS connections for yamux
  to use. This introduces more realistic timing issues, buffering, and
  subtle differences in error messages.
- testConnTypes is a helper to run subtests against *all* the above
  connection types as well as ensuring reversing the client/server
  sockets doesn't impact yamux (it didn't!). I didn't convert every test
  to it since it adds some time and noise to test runs.

I also tried to formalize (client, server) as a pattern. There was a mix
of orderings. Those roles are rarely meaningful to yamux, but meaningful
names makes debugging easier than numbering variables.
@lattwood
Copy link
Contributor Author

lattwood commented Sep 9, 2024

In the context of this- #23848

would it be worthwhile for us to try dropping in the libp2p yamux fork

schmichael added a commit that referenced this issue Sep 24, 2024
The main point of this dependency upgrade is to pull in the fixes in
hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has
been observed to improve the issue in #23305 but does not
seem sufficient to fix it entirely.

Since touching yamux is a rare and scary event, I do **not** intend to
backport this. If we discover the improvements are stable and
significant enough, or if further fixes land in yamux, backporting can
be done at that time.
schmichael added a commit that referenced this issue Sep 24, 2024
The main point of this dependency upgrade is to pull in the fixes in
hashicorp/yamux#127 which prevents leaking deadlocked goroutines. It has
been observed to improve the issue in #23305 but does not
seem sufficient to fix it entirely.

Since touching yamux is a rare and scary event, I do **not** intend to
backport this. If we discover the improvements are stable and
significant enough, or if further fixes land in yamux, backporting can
be done at that time.
@lattwood
Copy link
Contributor Author

lattwood commented Jan 4, 2025

@schmichael

I think the hung Read is a symptom, not the cause. I think the cause is a timed out write that the keepalive interval on the underlying session.

hashicorp/yamux#142

@lattwood
Copy link
Contributor Author

Fix opened- hashicorp/yamux#143

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

6 participants
@schmichael @david-yu @jrasell @lattwood @Juanadelacuesta and others