You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v1.9.5 but was happening with v1.9.1. Also this never happens on windows and those hosts are running v1.7.7
Docker version
Docker version 27.5.1, build 9f9e405
Operating system and Environment details
22.04.5 LTS (Jammy Jellyfish)
Issue
Jobs fail intermittently with:
failed to create container: Error response from daemon: Conflict. The container name "/build-6cc59855-66d6-6db4-9e87-0e3a6c669e88" is already in use by container "e0c1fd8704656777b88eb06bd01637e348ee1909dcb6c2384a6ad849f8b8b9cb". You have to remove (or rename) that container to be able to reuse that name.
The allocation ID in the container name {task-name}-{allocation-id} should be very unique, so really should not collide. I suspect it's some retry behavior in our docker plugin.
driver tries to remove the non-starting container (we do not handle or log if this errors, at present)
the start-container error is an ErrConflict (from docker library)
driver tries this all again, from the top
container fails to create, because the name is taken
After that, if your job group is configured to reschedule, then Nomad will place a new allocation, with a new allocation ID, and probably (since you say this is very rare) succeed.
To narrow this down, are you able to check the Nomad client agent logs during one of these occurrences? If the sequence of events I describe above is what's happening, then I would expect to see these logs:
INFO created container container_id={docker container ID}
ERROR failed to start container container_id={} error={some potentially informative error}
DEBUG reattempting container create/start sequence attempt={a number} container_id={}
ERROR failed to create container error={an error like the one you've reported here}
I mentioned on step 4 that we don't log the container removal attempt (line 410). We can add logging for that, which may yield new helpful logs.
And if you can acquire agent logs for us, it may help narrow down whether it's something worth retrying, or if something else entirely may be going on.
Nomad version
v1.9.5 but was happening with v1.9.1. Also this never happens on windows and those hosts are running v1.7.7
Docker version
Docker version 27.5.1, build 9f9e405
Operating system and Environment details
22.04.5 LTS (Jammy Jellyfish)
Issue
Jobs fail intermittently with:
Seems to be the same as an older issue: #2084
Reproduction steps
Unclear. We run hundreds or thousands of batch jobs per day and an unknown percentage of them fail with this error.
The text was updated successfully, but these errors were encountered: