Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nilpointer panic when decode addlitional item failed in restore #8563

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AlbeeSo
Copy link
Contributor

@AlbeeSo AlbeeSo commented Dec 30, 2024

Thank you for contributing to Velero!

Please add a summary of your change

When decode additional item in restoreItem failed,

errs.Add(namespace, errors.Wrapf(err, "error restoring additional item %s", additionalResourceID))

may case nilpointer panic in the inner function (like GetName, for exmple, as the obj is nil actually)
w, e, additionalItemExists := ctx.restoreItem(additionalObj, additionalItem.GroupResource, additionalItemNamespace)

with messages like (I tested in velero 1.14, but the related logic is same with the latest):

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
  panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x241f664]

goroutine 350 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
  /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x281c160?, 0x4725ed0?})
  /usr/local/go/src/runtime/panic.go:770 +0x132
k8s.io/apimachinery/pkg/apis/meta/v1/unstructured.(*Unstructured).GetName(...)
  /go/pkg/mod/k8s.io/[email protected]/pkg/apis/meta/v1/unstructured/unstructured.go:251
github.com/vmware-tanzu/velero/pkg/restore.(*restoreContext).restoreItem(0xc0006a66c8, 0x0, {{0x0, 0x0}, {0xc0017a7ec0, 0x11}}, {0x0, 0x0})
  /go/src/github.com/vmware-tanzu/velero/pkg/restore/restore.go:1087 +0x1e4
github.com/vmware-tanzu/velero/pkg/restore.(*restoreContext).restoreItem(0xc0006a66c8, 0xc0004b1a28, {{0x0, 0x0}, {0xc0004e4120, 0x16}}, {0xc000af003c, 0x6})
  /go/src/github.com/vmware-tanzu/velero/pkg/restore/restore.go:1394 +0xb026
github.com/vmware-tanzu/velero/pkg/restore.(*restoreContext).processSelectedResource(0xc0006a66c8, {{0xc0004e4120?, 0xc0007cc7d0?}, 0xc001c8f4d0?, 0xc000c0e780?}, 0x377, 0x5c, 0xc001a96708, 0xc001182120)
  /go/src/github.com/vmware-tanzu/velero/pkg/restore/restore.go:757 +0xb34
github.com/vmware-tanzu/velero/pkg/restore.(*restoreContext).execute(0xc0006a66c8)
  /go/src/github.com/vmware-tanzu/velero/pkg/restore/restore.go:609 +0x31ca
github.com/vmware-tanzu/velero/pkg/restore.(*kubernetesRestorer).RestoreWithResolvers(0xc000685d00, 0xc000501340, {{0xc000ab4000?, 0xc00097b1c8?, 0xddeb5c?}}, {0x7f7be4018ff0, 0xc0002d2820})
  /go/src/github.com/vmware-tanzu/velero/pkg/restore/restore.go:326 +0x13d1
github.com/vmware-tanzu/velero/pkg/controller.(*restoreReconciler).runValidatedRestore(0xc0007c0240, 0xc00012d608, {0xc0007c6a88?, 0xc000acd6c0?}, 0x0)
  /go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:565 +0x1273
github.com/vmware-tanzu/velero/pkg/controller.(*restoreReconciler).Reconcile(0xc0007c0240, {0x3144e00, 0xc00095e900}, {{{0xc000968cc8?, 0x0?}, {0xc000058f60?, 0xc00095e900?}}})
  /go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:264 +0xb45
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x314cfb8?, {0x3144e00?, 0xc00095e900?}, {{{0xc000968cc8?, 0xb?}, {0xc000058f60?, 0x0?}}})
  /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005f6780, {0x3144e38, 0xc00033ce10}, {0x297d740, 0xc0004766a0})
  /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316 +0x3bc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005f6780, {0x3144e38, 0xc00033ce10})
  /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x1be
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
  /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 248
  /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x50c

so I guess the right way is to give up restoring and just throw errors.

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

@AlbeeSo AlbeeSo force-pushed the fix/invalid-restore-addtional-item branch from 3060b6d to b9525af Compare December 30, 2024 04:04
Copy link

codecov bot commented Dec 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.18%. Comparing base (78c97d9) to head (24860e4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8563      +/-   ##
==========================================
+ Coverage   59.17%   59.18%   +0.01%     
==========================================
  Files         370      370              
  Lines       39487    39493       +6     
==========================================
+ Hits        23366    23375       +9     
+ Misses      14647    14645       -2     
+ Partials     1474     1473       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AlbeeSo
Copy link
Contributor Author

AlbeeSo commented Dec 30, 2024

/kind changelog-not-required

@github-actions github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Dec 30, 2024
@blackpiglet blackpiglet added kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes and removed kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes labels Dec 30, 2024
Signed-off-by: 司司 <[email protected]>
@anshulahuja98
Copy link
Collaborator

what is the scenario where this was actually hit?
Shouldn't we fix the rootcause?
Or was it a internal plugin for some user

@AlbeeSo
Copy link
Contributor Author

AlbeeSo commented Dec 31, 2024

what is the scenario where this was actually hit? Shouldn't we fix the rootcause? Or was it a internal plugin for some user

hi, sorry that i am still trying reproduce the problem as the testing k8s cluster has been deleted.
but i think this PR is common for such cases. i will submit another PR if needed if i find something.

@anshulahuja98
Copy link
Collaborator

I am not sure if it is right to continue the restore by just putting this in error.
It should ideally be a hard failure.

@AlbeeSo
Copy link
Contributor Author

AlbeeSo commented Jan 2, 2025

I am not sure if it is right to continue the restore by just putting this in error. It should ideally be a hard failure.

A hard failure is another choice, instead of the panic currently.
However I still think putting error is enough, to keep consistent with the normal items
https://github.com/vmware-tanzu/velero/blob/03d0bd9d22940f56a2fd30bd1f66c17a9b8a6f30/pkg/restore/restore.go#L750C1-L761C5

@anshulahuja98
Copy link
Collaborator

If we are hitting this issue, it means that it is not a per item failure, but a code bug potentially. Either in native or 3P plugins.
I understand panic might not be ideal, but I don't think we should treat this as an item level failure. My suggestion would be for hard failure.
The issue with marking it PartiallyFailed is that it is hard to catch code bugs. People / Users might ignore it.

Hard failures can be caught in E2E tests much easily.

@anshulahuja98
Copy link
Collaborator

@blackpiglet do you think my suggestion is aligned here? Or do you think current PR is better approach?

@blackpiglet
Copy link
Contributor

I didn't give much think to this error-handling logic before.
IMO, this error should not be related to the plugins. The plugins return the additional item IDs, but the plugins don't modify the item content.
The only chance the error happens is the JSON content gets corrupted during the network transferring.

In this case, I'm still not sure which one is a better choice, skip the failed item or fail the restore.
@anshulahuja98

@anshulahuja98
Copy link
Collaborator

"is the JSON content gets corrupted during the network transferring." - that seems like a very particular scenario. Not sure if that's what happened in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has-unit-tests kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants