Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Don't assume KubeadmControlPlane for nodeReuse #1311

Merged
merged 1 commit into from
Dec 12, 2023

Conversation

hardys
Copy link
Member

@hardys hardys commented Nov 9, 2023

It's valid to use other controlplane.cluster.x-k8s.io controlplane provider resources, so remove the explicit check for Kind

We already check the Group e.g controlplane.cluster.x-k8s.io which should be sufficient to ensure we're retrieving a controlplane resource name.

What this PR does / why we need it:

Fixes: #1310

@metal3-io-bot metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 9, 2023
@hardys
Copy link
Member Author

hardys commented Nov 9, 2023

@metal3-io-bot
Copy link
Contributor

@hardys: GitHub didn't allow me to request PR reviews from the following users: ipetrov117, tmmorin.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @furkatgofurov7 @ipetrov117 @tmmorin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@furkatgofurov7 furkatgofurov7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hardys 👋🏼 Thanks for the PR.

Some context on why only KCP was assumed: that is simply, we only used KCP everywhere as the ControlPlane provider (in all integration tests and other environments) since it is a de-facto control plane implementation used by CAPI and no other cp provider options were taken into the account by then. However, looking into the current proposed changes, I don't see any problems with it (not a breaking change, also probably not a bug 🐛 but rather 🌱 ) and it should give us more freedom of choice.

Except for the current state, there are few other places we need to make changes (replacing KCP/KubeadmControlPlane mentions with something generic, i.e ControlPlane) to align the docs + e2e tests with the current change:

  1. notes in the current repo: https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/docs/api.md#node-reuse-flow
  2. the original proposal in the metal3-docs repo: https://github.com/metal3-io/metal3-docs/blob/main/design/cluster-api-provider-metal3/node_reuse.md
  3. e2e tests we currently have, checks the feature logic with KCP which should be fine since we had to choose any CP provider to test it and that was KCP in the current e2e tests. I think it should be enough if we clarify that with something along the lines: "any other CP providers should also work in e2e, however, we are using KCP for testing node reuse feature in e2e" somewhere at the the top of node reuse e2e test description in
    * This test verifies the feature of reusing the same node after upgrading Kubernetes version in KubeadmControlPlane (KCP) and MachineDeployment (MD) nodes.

Additionally, let's run e2e tests on the PR to make sure we are not breaking them (@mboukhalfa or someone else can you please remind me what test trigger we are using to test the node reuse feature?)

baremetal/metal3machine_manager.go Outdated Show resolved Hide resolved
baremetal/metal3machine_manager_test.go Outdated Show resolved Hide resolved
baremetal/metal3machine_manager_test.go Show resolved Hide resolved
baremetal/metal3machine_manager.go Show resolved Hide resolved
@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 10, 2023
@hardys
Copy link
Member Author

hardys commented Nov 10, 2023

Some context on why only KCP was assumed: that is simply, we only used KCP everywhere as the ControlPlane provider (in all integration tests and other environments) since it is a de-facto control plane implementation used by CAPI and no other cp provider options were taken into the account by then. However, looking into the current proposed changes, I don't see any problems with it (not a breaking change, also probably not a bug 🐛 but rather 🌱 ) and it should give us more freedom of choice.

Thanks for the context - since we know there are communities leveraging Metal3 now which do require support for CAPM3 with other controplane providers (for example the Sylva community who found this issue) we can hopefully take this use-case into account in future 👍

Except for the current state, there are few other places we need to make changes (replacing KCP/KubeadmControlPlane mentions with something generic, i.e ControlPlane) to align the docs + e2e tests with the current change:

  1. notes in the current repo: https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/docs/api.md#node-reuse-flow

Done

  1. the original proposal in the metal3-docs repo: https://github.com/metal3-io/metal3-docs/blob/main/design/cluster-api-provider-metal3/node_reuse.md

I will propose a PR

  1. e2e tests we currently have, checks the feature logic with KCP which should be fine since we had to choose any CP provider to test it and that was KCP in the current e2e tests. I think it should be enough if we clarify that with something along the lines: "any other CP providers should also work in e2e, however, we are using KCP for testing node reuse feature in e2e" somewhere at the the top of node reuse e2e test description in
    * This test verifies the feature of reusing the same node after upgrading Kubernetes version in KubeadmControlPlane (KCP) and MachineDeployment (MD) nodes.

Done - perhaps in future we can consider if there's any way to extend this coverage, but for now we know there will be downstream testing that covers this scenario (such as in the CI run by the Sylva community mentioned above)

@hardys
Copy link
Member Author

hardys commented Nov 10, 2023

Thanks for the review @furkatgofurov7 - I think I addressed all your comments 👍

Copy link
Member

@furkatgofurov7 furkatgofurov7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looking good from my side except a single small nit:

baremetal/metal3machine_manager_test.go Outdated Show resolved Hide resolved
@furkatgofurov7
Copy link
Member

Done - perhaps in future we can consider if there's any way to extend this coverage, but for now we know there will be downstream testing that covers this scenario (such as in the CI run by the Sylva community mentioned above)

+1 on extending the coverage and adding upstream tests with other CP provider(s).

@@ -535,7 +535,7 @@ When `spec.nodeReuse` field of metal3MachineTemplate is set to `True`, CAPM3
Machine controller:

- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the
Copy link
Contributor

@tmmorin tmmorin Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the
type of owner object (`ControlPlane` for a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the

Copy link
Member

@furkatgofurov7 furkatgofurov7 Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmmorin with ^, we are changing the explanation of fundamental logic behind the feature, we are not setting infrastructure.cluster.x-k8s.io/node-reuse label to the corresponding type of owner object on BMH, but the name of that object. For example, if CP name is called cp-test1, machine controller will set following label on BMH: infrastructure.cluster.x-k8s.io/node-reuse: cp-test1

@tmmorin
Copy link
Contributor

tmmorin commented Nov 10, 2023

Thank you very much for bringing this fix @hardys 👍

As far as I can tell with my limited knowledge of Metal3, it looks good.

Comment on lines 537 to +538
- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the
Copy link
Member

@furkatgofurov7 furkatgofurov7 Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmmorin with ^, we are changing the explanation of fundamental logic behind the feature, we are not setting infrastructure.cluster.x-k8s.io/node-reuse label to the corresponding type of owner object on BMH, but the name of that object. For example, if CP name is called cp-test1, machine controller will set following label on BMH: infrastructure.cluster.x-k8s.io/node-reuse: cp-test1

@@ -535,7 +535,7 @@ When `spec.nodeReuse` field of metal3MachineTemplate is set to `True`, CAPM3
Machine controller:

- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, indeed

the text could capture the cp- or md- prefix though

Suggested change
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`), prefixed by `cp-` or`md-` (to distinguish control plane and machine deployments), on the

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. It seems we have no protection from name collisions currently. I would suggest a separate PR for this though since it would be more of a change in behavior. We will need to consider upgrade scenarios also if we change it (probably just document that CAPM3 should not be upgraded at the same time as the MD or CP is rolling out).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behaviour in the event of a ControlPlane/MachineDeployment name collision is well defined, the controlplane is checked first so will take precedence, we could perhaps just document that to avoid any change in behaviour?

@hardys
Copy link
Member Author

hardys commented Nov 12, 2023

/test-centos-e2e-integration-main

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Nov 13, 2023

test-centos-e2e-integration-main

looks like it was not triggered at all, I vaguely remember seeing messages about CI problems in the provider channel recently which might be related. Let's try again:

/test-centos-e2e-integration-main

@furkatgofurov7
Copy link
Member

Additionally, let's run e2e tests on the PR to make sure we are not breaking them (@mboukhalfa or someone else can you please remind me what test trigger we are using to test the node reuse feature?)

Additionally, it should be:

/test-ubuntu-e2e-feature-main

@hardys
Copy link
Member Author

hardys commented Nov 14, 2023

/test-ubuntu-integration-main

@furkatgofurov7
Copy link
Member

Additionally, it should be:
test-ubuntu-e2e-feature-main

Looks like it did not trigger the tests at all, let's try again:

/test-ubuntu-e2e-feature-main

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Nov 15, 2023

Unrelated to these changes package update error https://jenkins.nordix.org/job/metal3_capm3_main_e2e_feature_test_ubuntu/58/execution/node/70/log/, might be just one time flakiness

/test-ubuntu-e2e-feature-main

@furkatgofurov7
Copy link
Member

Alright, it is not a flake actually, but a wider issue discussed upstream, xref: containers/podman#20690

@hardys
Copy link
Member Author

hardys commented Nov 20, 2023

Alright, it is not a flake actually, but a wider issue discussed upstream, xref: containers/podman#20690

This issue looks to be closed now so trying again

/test-ubuntu-e2e-feature-main

Copy link
Member

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me
/approve

@@ -535,7 +535,7 @@ When `spec.nodeReuse` field of metal3MachineTemplate is set to `True`, CAPM3
Machine controller:

- Sets `infrastructure.cluster.x-k8s.io/node-reuse` label to the corresponding
CAPI object name (`KubeadmControlPlane` or `MachineDeployment`) on the
CAPI object name (a `controlplane.cluster.x-k8s.io` object such as `KubeadmControlPlane` or a `MachineDeployment`) on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. It seems we have no protection from name collisions currently. I would suggest a separate PR for this though since it would be more of a change in behavior. We will need to consider upgrade scenarios also if we change it (probably just document that CAPM3 should not be upgraded at the same time as the MD or CP is rolling out).

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 24, 2023
@lentzi90
Copy link
Member

/test-ubuntu-e2e-feature-main

@hardys
Copy link
Member Author

hardys commented Dec 6, 2023

Its a pity that the CI is still struggling with this broken pipe issue for so long. Lets try the luck one more time:

I'm not familiar with the details of that issue, is there somewhere I can track it and help collaborate on a fix?

@kashifest
Copy link
Member

Its a pity that the CI is still struggling with this broken pipe issue for so long. Lets try the luck one more time:

I'm not familiar with the details of that issue, is there somewhere I can track it and help collaborate on a fix?

HI, we have done some analysis and asked the infra provider for CI cleura to investigate since the issue is coming from infra side. The ticket is escalated there and they are now investigating. It is not currently tracked in any github issue.

@kashifest
Copy link
Member

/test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

2 similar comments
@kashifest
Copy link
Member

/test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

@furkatgofurov7
Copy link
Member

/test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

@Rozzii
Copy link
Member

Rozzii commented Dec 7, 2023

Could I ask you @hardys to please sign off the commits, we plan to enable DCO enforcement. (https://github.com/apps/dco)

It's valid to use other controlplane.cluster.x-k8s.io controlplane
provider resources, so remove the explicit check for Kind

We already check the Group e.g controlplane.cluster.x-k8s.io which
should be sufficient to ensure we're retrieving a controlplane resource
name.

Fixes: metal3-io#1310
Signed-off-by: Steven <[email protected]>
@hardys
Copy link
Member Author

hardys commented Dec 7, 2023

Could I ask you @hardys to please sign off the commits, we plan to enable DCO enforcement. (https://github.com/apps/dco)

Sure, rebased and amended 👍

@hardys
Copy link
Member Author

hardys commented Dec 8, 2023

/test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

3 similar comments
@kashifest
Copy link
Member

/test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

@kashifest
Copy link
Member

/test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

@kashifest
Copy link
Member

/test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

@kashifest
Copy link
Member

The ubuntu based feature test was very close to pass https://jenkins.nordix.org/job/metal3_capm3_main_e2e_feature_test_ubuntu/83/ , It received a broken pipe after the node reuse was performed and repivot was done and the cluster was in deleting state. That is the closest it has went, so I think this should be good to go. We can try one last time, but I will anyhow override centos test
/override test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

@metal3-io-bot
Copy link
Contributor

@kashifest: Overrode contexts on behalf of kashifest: test-centos-e2e-feature-main

In response to this:

The ubuntu based feature test was very close to pass https://jenkins.nordix.org/job/metal3_capm3_main_e2e_feature_test_ubuntu/83/ , It received a broken pipe after the node reuse was performed and repivot was done and the cluster was in deleting state. That is the closest it has went, so I think this should be good to go. We can try one last time, but I will anyhow override centos test
/override test-centos-e2e-feature-main
/test-ubuntu-e2e-feature-main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kashifest
Copy link
Member

/test-centos-e2e-integration-main
/test-ubuntu-integration-main

@kashifest
Copy link
Member

/test-centos-e2e-integration-main
/test-ubuntu-integration-main

@kashifest
Copy link
Member

kashifest commented Dec 11, 2023

In the last run node-reuse passed https://jenkins.nordix.org/job/metal3_capm3_main_e2e_feature_test_ubuntu/84/, so if we combine the last two runs, we can safely say feature test is passing.

I am not sure if I would call it a bug though, when it was implemented as a feature we assumed we would use KCP only for this feature. So, IMO we can also consider it as an extension to that feature.

/lgtm
/hold
cc @furkatgofurov7 @hardys

@metal3-io-bot metal3-io-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Dec 11, 2023
@kashifest
Copy link
Member

/override test-ubuntu-e2e-feature-main

@metal3-io-bot
Copy link
Contributor

@kashifest: Overrode contexts on behalf of kashifest: test-ubuntu-e2e-feature-main

In response to this:

/override test-ubuntu-e2e-feature-main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@furkatgofurov7
Copy link
Member

Thanks a lot @kashifest 👍🏼

I am not sure if I would call it a bug though, when it was implemented as a feature we assumed we would use KCP only for this feature. So, IMO we can also consider it as an extension to that feature.

+1, @hardys what do you think?

@hardys
Copy link
Member Author

hardys commented Dec 11, 2023

Thanks a lot @kashifest 👍🏼

I am not sure if I would call it a bug though, when it was implemented as a feature we assumed we would use KCP only for this feature. So, IMO we can also consider it as an extension to that feature.

+1, @hardys what do you think?

From my perspective an infrastructure provider shouldn't really be coupled to a specific controlplane provider, but I'm certainly fine to retitle this if the consensus is to descibe it as an enhancement 👍

@hardys
Copy link
Member Author

hardys commented Dec 12, 2023

/retitle ✨ Don't assume KubeadmControlPlane for nodeReuse

@metal3-io-bot metal3-io-bot changed the title 🐛 Don't assume KubeadmControlPlane for nodeReuse ✨ Don't assume KubeadmControlPlane for nodeReuse Dec 12, 2023
@kashifest
Copy link
Member

/hold cancel

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 12, 2023
@metal3-io-bot metal3-io-bot merged commit 24e63ad into metal3-io:main Dec 12, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't assume KubeadmControlPlane is the controlplane provider
8 participants