Bubbling up relevant error information during site deploy #17

darrendejaeger · 2018-01-31T18:11:28Z

I've encountered a couple scenarios where the site deploy toolset at my disposal has not been very "clear" when it comes to understanding an issue that has occurred during the deployment process. An example of this: I specified a particular package to be installed onto the hosts during the deployment process via the site manifests. There wasn't any issue at all on the Genesis host. However, when I got to the actual site deploy, I ran into some trouble that was difficult to track down. The MaaS GUI was showing nodes "deployed", but they weren't joining to my k8s cluster. Digging deeper showed the following:

Queried node's BMC - Power state queried: onFri, 19 Jan. 2018 17:13:37
Node post-installation failure - 'cloudinit' running modules for configFri, 19 Jan. 2018 17:13:32
Node post-installation failure - 'cloudinit' running config-apt-configure with frequency once-per-instanceFri, 19 Jan. 2018 17:13:24
Node changed status - From 'Deploying' to 'Deployed'Fri, 19 Jan. 2018 17:13:17

Digging deeper, I searched the clout-init logs on the particular host (/var/log/cloud-init-output.log, /var/log/cloud-init.log), but came up empty-handed. It wasn't until I examined /var/log/syslog that I found my problem:

17:23:59 promjoin.sh[2230]: + apt-get install -y --no-install-recommends ceph-common=10.2.7-0ubuntu0.16.04.1 curl jq docker-engine=1.13.1-0~ubuntu-xenial socat=1.7.3.1-1
17:23:59 promjoin.sh[2230]: Reading package lists...
17:23:59 promjoin.sh[2230]: Building dependency tree...
17:23:59 promjoin.sh[2230]: Reading state information...
17:23:59 promjoin.sh[2230]: E: Version '10.2.7-0ubuntu0.16.04.1' for 'ceph-common' was not found
17:23:59 promjoin.sh[2230]: ++ date +%s
17:23:59 promjoin.sh[2230]: + now=1516382639
17:23:59 promjoin.sh[2230]: + [[ 1516382639 -gt 1516382635 ]]
17:23:59 promjoin.sh[2230]: + log Failed to install apt packages.
17:23:59 promjoin.sh[2230]: ++ date
17:23:59 promjoin.sh[2230]: + echo Fri Jan 19 17:23:59 UTC 2018 Failed to install apt packages.
17:23:59 promjoin.sh[2230]: Fri Jan 19 17:23:59 UTC 2018 Failed to install apt packages.
17:23:59 promjoin.sh[2230]: + exit 1
17:23:59 systemd[1]: promjoin.service: Main process exited, code=exited, status=1/FAILURE
17:23:59 systemd[1]: promjoin.service: Unit entered failed state.
17:23:59 systemd[1]: promjoin.service: Failed with result 'exit-code'.

Would it be possible, in some fashion, to make it easier to determine root cause for these types of troubles?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bubbling up relevant error information during site deploy #17

Bubbling up relevant error information during site deploy #17

darrendejaeger commented Jan 31, 2018

Bubbling up relevant error information during site deploy #17

Bubbling up relevant error information during site deploy #17

Comments

darrendejaeger commented Jan 31, 2018