Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to schedule task: the container name is already in use by container #24940

Open
gscho opened this issue Jan 24, 2025 · 1 comment
Open

Comments

@gscho
Copy link

gscho commented Jan 24, 2025

Nomad version

v1.9.5 but was happening with v1.9.1. Also this never happens on windows and those hosts are running v1.7.7

Docker version

Docker version 27.5.1, build 9f9e405

Operating system and Environment details

22.04.5 LTS (Jammy Jellyfish)

Issue

Jobs fail intermittently with:

failed to create container: Error response from daemon: Conflict. The container name "/build-6cc59855-66d6-6db4-9e87-0e3a6c669e88" is already in use by container "e0c1fd8704656777b88eb06bd01637e348ee1909dcb6c2384a6ad849f8b8b9cb". You have to remove (or rename) that container to be able to reuse that name.

Seems to be the same as an older issue: #2084

Reproduction steps

Unclear. We run hundreds or thousands of batch jobs per day and an unknown percentage of them fail with this error.

Image

@gulducat
Copy link
Member

Thanks for the report @gscho.

The allocation ID in the container name {task-name}-{allocation-id} should be very unique, so really should not collide. I suspect it's some retry behavior in our docker plugin.

I'm looking at this section of the driver code: https://github.com/hashicorp/nomad/blob/v1.9.5/drivers/docker/driver.go#L391-L415

High level, I think this can happen if

  1. Nomad docker driver creates the container
  2. but it isn't Running
  3. it doesn't start when the driver asks it to
  4. driver tries to remove the non-starting container (we do not handle or log if this errors, at present)
  5. the start-container error is an ErrConflict (from docker library)
  6. driver tries this all again, from the top
  7. container fails to create, because the name is taken

After that, if your job group is configured to reschedule, then Nomad will place a new allocation, with a new allocation ID, and probably (since you say this is very rare) succeed.

To narrow this down, are you able to check the Nomad client agent logs during one of these occurrences? If the sequence of events I describe above is what's happening, then I would expect to see these logs:

INFO created container container_id={docker container ID}
ERROR failed to start container container_id={} error={some potentially informative error}
DEBUG reattempting container create/start sequence attempt={a number} container_id={}
ERROR failed to create container error={an error like the one you've reported here}

I mentioned on step 4 that we don't log the container removal attempt (line 410). We can add logging for that, which may yield new helpful logs.

And if you can acquire agent logs for us, it may help narrow down whether it's something worth retrying, or if something else entirely may be going on.

@gulducat gulducat moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants