Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(relatedresource): remove event handlers when context is closed #361

Merged
merged 3 commits into from
Mar 7, 2024

Conversation

aruiz14
Copy link
Contributor

@aruiz14 aruiz14 commented Feb 15, 2024

Fixes rancher/rancher#44090

I'm investigating a memory leak in Rancher manager, which makes the heap to constantly increase overtime (~1GB per day in the case I'm observing). By analyzing the memory profiles produces by a custom build (including this patch), I was able to determine that most of this memory was used for downstream cluster caches (of various Kubernetes kinds).

I managed to reproduce this symptom at a smaller scale locally, by repeatedly registering/unregistering a downstream cluster, then observing how the baseline for the heap increased (and returned to normal values after a restart). This allowed me to use a debugger and custom pprof profiles to narrow down which goroutines could be leaked, which pointed me to 2 different cases, both using relatedresource:

I noticed that relatedresource configures 2 different handlers, one using the higher level AddGenericHandler, which is context-aware, and a lower-level one that just implements DeleteFunc using the controller's Informer.
While I'm not familiar enough with wrangler to know if both of them are really needed, here I'm taking the chance of client-go being recently upgraded to make the latter case also be context-aware.

@aruiz14 aruiz14 force-pushed the remove-event-handlers branch from 098e010 to 5a201f0 Compare February 15, 2024 16:44
@tomleb tomleb self-requested a review February 15, 2024 20:47
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@tomleb tomleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting findings. I don't understand everything yet that's going on in wrangler but I did my best to verify the fix. I created a small Go program that would register/unregister a controller for ConfigMaps with relatedresource being Secret (heh, doesn't matter). Without the fix the number of goroutines keep going up, with the fix the number of go routine stays static.

EDIT: Adding code here, might be useful, took me a while getting it to work..

main.go
package main

import (
	"context"
	"fmt"
	"os"
	goruntime "runtime"
	"time"

	"github.com/rancher/lasso/pkg/controller"
	"github.com/rancher/wrangler/v2/pkg/generated/controllers/core"
	"github.com/rancher/wrangler/v2/pkg/generic"
	"github.com/rancher/wrangler/v2/pkg/kubeconfig"
	"github.com/rancher/wrangler/v2/pkg/relatedresource"
	"github.com/rancher/wrangler/v2/pkg/schemes"
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
)

var Scheme = runtime.NewScheme()

func init() {
	metav1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})
	utilruntime.Must(schemes.AddToScheme(Scheme))

}

func main() {
	restConfig, err := kubeconfig.GetNonInteractiveClientConfig(os.Getenv("KUBECONFIG")).ClientConfig()
	must(err)

	controllerFactory, err := controller.NewSharedControllerFactoryFromConfigWithOptions(restConfig, Scheme, nil)
	must(err)

	opts := &generic.FactoryOptions{
		SharedControllerFactory: controllerFactory,
	}

	core, err := core.NewFactoryFromConfigWithOptions(restConfig, opts)
	must(err)

	parentCtx := context.Background()

	ctx, cancel := context.WithCancel(parentCtx)
	Register(ctx, core)
	time.Sleep(5 * time.Second)
	cancel()
	time.Sleep(2 * time.Second)

	fmt.Println("SharedCacheFactory.Start")
	err = controllerFactory.SharedCacheFactory().Start(parentCtx)
	must(err)

	fmt.Println("WaitForCacheSync")
	controllerFactory.SharedCacheFactory().WaitForCacheSync(parentCtx)

	fmt.Println("ControllerFactory.Start")
	err = controllerFactory.Start(parentCtx, 10)
	must(err)

	for {
		ctx, cancel := context.WithCancel(parentCtx)
		Register(ctx, core)
		cancel()
		time.Sleep(2 * time.Second)
		fmt.Println(goruntime.NumGoroutine())
	}
}

func Register(ctx context.Context, core *core.Factory) {
	core.Core().V1().ConfigMap().OnChange(ctx, "cm", func(_ string, cm *corev1.ConfigMap) (*corev1.ConfigMap, error) {
		return cm, nil
	})
	resolve := func(namespace, name string, obj runtime.Object) ([]relatedresource.Key, error) {
		return []relatedresource.Key{
			{Namespace: "default", Name: "kube-root-ca.crt"},
		}, nil
	}
	relatedresource.Watch(
		ctx,
		"cm-secret",
		resolve,
		core.Core().V1().ConfigMap(),
		core.Core().V1().Secret())

}

func must(err error) {
	if err != nil {
		panic(err)
	}
}

@tomleb
Copy link
Contributor

tomleb commented Feb 16, 2024

I noticed that relatedresource configures 2 different handlers, one using the higher level AddGenericHandler, which is context-aware, and a lower-level one that just implements DeleteFunc using the controller's Informer. While I'm not familiar enough with wrangler to know if both of them are really needed, here I'm taking the chance of client-go being recently upgraded to make the latter case also be context-aware.

It seems like handlers added with AddGenericHandler don't run when an object is deleted, so maybe that's why this is needed (from my limited tested anyway, thuogh that suprises me so I'll have to re-verify..). There's also AddGenericRemoveHandler but that requires adding a finalizer and it's not part of the ControllerWrapper interface, so adding it there would be a breaking change.

Copy link
Contributor

@moio moio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@MbolotSuse MbolotSuse merged commit 762039f into rancher:master Mar 7, 2024
2 checks passed
aruiz14 added a commit to aruiz14/rancher that referenced this pull request Mar 7, 2024
@aruiz14 aruiz14 deleted the remove-event-handlers branch May 7, 2024 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SURE-7356] Rancher pods memory increasing over time
5 participants