-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Don't assume KubeadmControlPlane for nodeReuse #1311
Conversation
@hardys: GitHub didn't allow me to request PR reviews from the following users: ipetrov117, tmmorin. Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hardys 👋🏼 Thanks for the PR.
Some context on why only KCP was assumed: that is simply, we only used KCP everywhere as the ControlPlane provider (in all integration tests and other environments) since it is a de-facto control plane implementation used by CAPI and no other cp provider options were taken into the account by then. However, looking into the current proposed changes, I don't see any problems with it (not a breaking change, also probably not a bug 🐛 but rather 🌱 ) and it should give us more freedom of choice.
Except for the current state, there are few other places we need to make changes (replacing KCP/KubeadmControlPlane
mentions with something generic, i.e ControlPlane
) to align the docs + e2e tests with the current change:
- notes in the current repo: https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/docs/api.md#node-reuse-flow
- the original proposal in the metal3-docs repo: https://github.com/metal3-io/metal3-docs/blob/main/design/cluster-api-provider-metal3/node_reuse.md
- e2e tests we currently have, checks the feature logic with KCP which should be fine since we had to choose any CP provider to test it and that was KCP in the current e2e tests. I think it should be enough if we clarify that with something along the lines: "any other CP providers should also work in e2e, however, we are using KCP for testing node reuse feature in e2e" somewhere at the the top of node reuse e2e test description in
* This test verifies the feature of reusing the same node after upgrading Kubernetes version in KubeadmControlPlane (KCP) and MachineDeployment (MD) nodes.
Additionally, let's run e2e tests on the PR to make sure we are not breaking them (@mboukhalfa or someone else can you please remind me what test trigger we are using to test the node reuse feature?)
Thanks for the context - since we know there are communities leveraging Metal3 now which do require support for CAPM3 with other controplane providers (for example the Sylva community who found this issue) we can hopefully take this use-case into account in future 👍
Done
I will propose a PR
Done - perhaps in future we can consider if there's any way to extend this coverage, but for now we know there will be downstream testing that covers this scenario (such as in the CI run by the Sylva community mentioned above) |
Thanks for the review @furkatgofurov7 - I think I addressed all your comments 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looking good from my side except a single small nit:
+1 on extending the coverage and adding upstream tests with other CP provider(s). |
@@ -535,7 +535,7 @@ When `spec.nodeReuse` field of metal3MachineTemplate is set to `True`, CAPM3 | |||
Machine controller: | |||
|
|||
- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding | |||
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the | |||
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the | |
type of owner object (`ControlPlane` for a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmmorin with ^, we are changing the explanation of fundamental logic behind the feature, we are not setting infrastructure.cluster.x-k8s.io/node-reuse
label to the corresponding type of owner object on BMH, but the name of that object. For example, if CP name is called cp-test1
, machine controller will set following label on BMH: infrastructure.cluster.x-k8s.io/node-reuse: cp-test1
Thank you very much for bringing this fix @hardys 👍 As far as I can tell with my limited knowledge of Metal3, it looks good. |
- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding | ||
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the | ||
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmmorin with ^, we are changing the explanation of fundamental logic behind the feature, we are not setting infrastructure.cluster.x-k8s.io/node-reuse
label to the corresponding type of owner object on BMH, but the name of that object. For example, if CP name is called cp-test1
, machine controller will set following label on BMH: infrastructure.cluster.x-k8s.io/node-reuse: cp-test1
@@ -535,7 +535,7 @@ When `spec.nodeReuse` field of metal3MachineTemplate is set to `True`, CAPM3 | |||
Machine controller: | |||
|
|||
- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding | |||
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the | |||
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, indeed
the text could capture the cp-
or md-
prefix though
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the | |
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`), prefixed by `cp-` or`md-` (to distinguish control plane and machine deployments), on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. It seems we have no protection from name collisions currently. I would suggest a separate PR for this though since it would be more of a change in behavior. We will need to consider upgrade scenarios also if we change it (probably just document that CAPM3 should not be upgraded at the same time as the MD or CP is rolling out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the behaviour in the event of a ControlPlane/MachineDeployment name collision is well defined, the controlplane is checked first so will take precedence, we could perhaps just document that to avoid any change in behaviour?
/test-centos-e2e-integration-main |
looks like it was not triggered at all, I vaguely remember seeing messages about CI problems in the provider channel recently which might be related. Let's try again: /test-centos-e2e-integration-main |
Additionally, it should be: /test-ubuntu-e2e-feature-main |
/test-ubuntu-integration-main |
Looks like it did not trigger the tests at all, let's try again: /test-ubuntu-e2e-feature-main |
Unrelated to these changes /test-ubuntu-e2e-feature-main |
Alright, it is not a flake actually, but a wider issue discussed upstream, xref: containers/podman#20690 |
This issue looks to be closed now so trying again /test-ubuntu-e2e-feature-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
/approve
@@ -535,7 +535,7 @@ When `spec.nodeReuse` field of metal3MachineTemplate is set to `True`, CAPM3 | |||
Machine controller: | |||
|
|||
- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding | |||
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the | |||
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. It seems we have no protection from name collisions currently. I would suggest a separate PR for this though since it would be more of a change in behavior. We will need to consider upgrade scenarios also if we change it (probably just document that CAPM3 should not be upgraded at the same time as the MD or CP is rolling out).
/test-ubuntu-e2e-feature-main |
I'm not familiar with the details of that issue, is there somewhere I can track it and help collaborate on a fix? |
HI, we have done some analysis and asked the infra provider for CI cleura to investigate since the issue is coming from infra side. The ticket is escalated there and they are now investigating. It is not currently tracked in any github issue. |
/test-centos-e2e-feature-main |
2 similar comments
/test-centos-e2e-feature-main |
/test-centos-e2e-feature-main |
Could I ask you @hardys to please sign off the commits, we plan to enable DCO enforcement. (https://github.com/apps/dco) |
It's valid to use other controlplane.cluster.x-k8s.io controlplane provider resources, so remove the explicit check for Kind We already check the Group e.g controlplane.cluster.x-k8s.io which should be sufficient to ensure we're retrieving a controlplane resource name. Fixes: metal3-io#1310 Signed-off-by: Steven <[email protected]>
Sure, rebased and amended 👍 |
/test-centos-e2e-feature-main |
3 similar comments
/test-centos-e2e-feature-main |
/test-centos-e2e-feature-main |
/test-centos-e2e-feature-main |
The ubuntu based feature test was very close to pass https://jenkins.nordix.org/job/metal3_capm3_main_e2e_feature_test_ubuntu/83/ , It received a broken pipe after the node reuse was performed and repivot was done and the cluster was in deleting state. That is the closest it has went, so I think this should be good to go. We can try one last time, but I will anyhow override centos test |
@kashifest: Overrode contexts on behalf of kashifest: test-centos-e2e-feature-main In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test-centos-e2e-integration-main |
/test-centos-e2e-integration-main |
In the last run node-reuse passed https://jenkins.nordix.org/job/metal3_capm3_main_e2e_feature_test_ubuntu/84/, so if we combine the last two runs, we can safely say feature test is passing. I am not sure if I would call it a bug though, when it was implemented as a feature we assumed we would use KCP only for this feature. So, IMO we can also consider it as an extension to that feature. /lgtm |
/override test-ubuntu-e2e-feature-main |
@kashifest: Overrode contexts on behalf of kashifest: test-ubuntu-e2e-feature-main In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks a lot @kashifest 👍🏼
+1, @hardys what do you think? |
From my perspective an infrastructure provider shouldn't really be coupled to a specific controlplane provider, but I'm certainly fine to retitle this if the consensus is to descibe it as an enhancement 👍 |
/retitle ✨ Don't assume KubeadmControlPlane for nodeReuse |
/hold cancel |
It's valid to use other
controlplane.cluster.x-k8s.io
controlplane provider resources, so remove the explicit check for KindWe already check the Group e.g
controlplane.cluster.x-k8s.io
which should be sufficient to ensure we're retrieving a controlplane resource name.What this PR does / why we need it:
Fixes: #1310