fix(relatedresource): remove event handlers when context is closed #361

aruiz14 · 2024-02-15T16:21:37Z

I'm investigating a memory leak in Rancher manager, which makes the heap to constantly increase overtime (~1GB per day in the case I'm observing). By analyzing the memory profiles produces by a custom build (including this patch), I was able to determine that most of this memory was used for downstream cluster caches (of various Kubernetes kinds).

I managed to reproduce this symptom at a smaller scale locally, by repeatedly registering/unregistering a downstream cluster, then observing how the baseline for the heap increased (and returned to normal values after a restart). This allowed me to use a debugger and custom pprof profiles to narrow down which goroutines could be leaked, which pointed me to 2 different cases, both using relatedresource:

https://github.com/rancher/rancher/blob/v2.7.9/pkg/controllers/managementuser/rbac/handler_base.go#L136-L137
- Although the Register function being called per-cluster, this line registers the handler in the management context (which points to the upstream cluster), which causes the same handler to be registered many times.
- Despite being unnefficient, this is not the culprit for the memory leak. I'll open a separate issue in https://github.com/rancher/rancher in order to fix it, either by moving it somewhere else or wrapping it with sync.Once.
https://github.com/rancher/rancher/blob/v2.7.9/pkg/controllers/managementuser/snapshotbackpopulate/snapshotbackpopulate.go#L71-L79
This usage of relatedresource.Watch will use an informer for the management context to enqueue the ConfigMaps controller in the user context, with multiple consequences:
- The management context will "outlive" the user context, either if the cluster is removed or assigned to a different replica (in the case of clustered setups), but this handler is never removed.
- The version of client-go used in Rancher 2.7.9 (0.25.4) does not support removing handlers from informers. However, it's supported starting on 0.26.0.
- Since the listener has a reference to the user context controller (used to Enqueue), and it's a shared controller, which also holds a reference to a SharedCacheFactory, hence leaking all the downstream caches.

I noticed that relatedresource configures 2 different handlers, one using the higher level AddGenericHandler, which is context-aware, and a lower-level one that just implements DeleteFunc using the controller's Informer.
While I'm not familiar enough with wrangler to know if both of them are really needed, here I'm taking the chance of client-go being recently upgraded to make the latter case also be context-aware.

rmweir

LGTM

tomleb

Interesting findings. I don't understand everything yet that's going on in wrangler but I did my best to verify the fix. I created a small Go program that would register/unregister a controller for ConfigMaps with relatedresource being Secret (heh, doesn't matter). Without the fix the number of goroutines keep going up, with the fix the number of go routine stays static.

EDIT: Adding code here, might be useful, took me a while getting it to work..

main.go

package main

import (
	"context"
	"fmt"
	"os"
	goruntime "runtime"
	"time"

	"github.com/rancher/lasso/pkg/controller"
	"github.com/rancher/wrangler/v2/pkg/generated/controllers/core"
	"github.com/rancher/wrangler/v2/pkg/generic"
	"github.com/rancher/wrangler/v2/pkg/kubeconfig"
	"github.com/rancher/wrangler/v2/pkg/relatedresource"
	"github.com/rancher/wrangler/v2/pkg/schemes"
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
)

var Scheme = runtime.NewScheme()

func init() {
	metav1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})
	utilruntime.Must(schemes.AddToScheme(Scheme))

}

func main() {
	restConfig, err := kubeconfig.GetNonInteractiveClientConfig(os.Getenv("KUBECONFIG")).ClientConfig()
	must(err)

	controllerFactory, err := controller.NewSharedControllerFactoryFromConfigWithOptions(restConfig, Scheme, nil)
	must(err)

	opts := &generic.FactoryOptions{
		SharedControllerFactory: controllerFactory,
	}

	core, err := core.NewFactoryFromConfigWithOptions(restConfig, opts)
	must(err)

	parentCtx := context.Background()

	ctx, cancel := context.WithCancel(parentCtx)
	Register(ctx, core)
	time.Sleep(5 * time.Second)
	cancel()
	time.Sleep(2 * time.Second)

	fmt.Println("SharedCacheFactory.Start")
	err = controllerFactory.SharedCacheFactory().Start(parentCtx)
	must(err)

	fmt.Println("WaitForCacheSync")
	controllerFactory.SharedCacheFactory().WaitForCacheSync(parentCtx)

	fmt.Println("ControllerFactory.Start")
	err = controllerFactory.Start(parentCtx, 10)
	must(err)

	for {
		ctx, cancel := context.WithCancel(parentCtx)
		Register(ctx, core)
		cancel()
		time.Sleep(2 * time.Second)
		fmt.Println(goruntime.NumGoroutine())
	}
}

func Register(ctx context.Context, core *core.Factory) {
	core.Core().V1().ConfigMap().OnChange(ctx, "cm", func(_ string, cm *corev1.ConfigMap) (*corev1.ConfigMap, error) {
		return cm, nil
	})
	resolve := func(namespace, name string, obj runtime.Object) ([]relatedresource.Key, error) {
		return []relatedresource.Key{
			{Namespace: "default", Name: "kube-root-ca.crt"},
		}, nil
	}
	relatedresource.Watch(
		ctx,
		"cm-secret",
		resolve,
		core.Core().V1().ConfigMap(),
		core.Core().V1().Secret())

}

func must(err error) {
	if err != nil {
		panic(err)
	}
}

tomleb · 2024-02-16T18:07:02Z

I noticed that relatedresource configures 2 different handlers, one using the higher level AddGenericHandler, which is context-aware, and a lower-level one that just implements DeleteFunc using the controller's Informer. While I'm not familiar enough with wrangler to know if both of them are really needed, here I'm taking the chance of client-go being recently upgraded to make the latter case also be context-aware.

It seems like handlers added with AddGenericHandler don't run when an object is deleted, so maybe that's why this is needed (from my limited tested anyway, thuogh that suprises me so I'll have to re-verify..). There's also AddGenericRemoveHandler but that requires adding a finalizer and it's not part of the ControllerWrapper interface, so adding it there would be a breaking change.

includes rancher/wrangler#361 Signed-off-by: Silvio Moioli <[email protected]>

moio

lgtm

In order to include rancher/wrangler#361

fix(relatedresource): remove event handler when context is closed

afd01a1

aruiz14 requested review from moio, rmweir and MbolotSuse February 15, 2024 16:21

fix: go 1.20 compatibility

5a201f0

aruiz14 force-pushed the remove-event-handlers branch from 098e010 to 5a201f0 Compare February 15, 2024 16:44

tomleb self-requested a review February 15, 2024 20:47

rmweir approved these changes Feb 15, 2024

View reviewed changes

tomleb approved these changes Feb 16, 2024

View reviewed changes

Merge branch 'master' into remove-event-handlers

06dfe87

moio added a commit to rancher/rancher that referenced this pull request Mar 4, 2024

wrangler: use patched version

7c91f79

includes rancher/wrangler#361 Signed-off-by: Silvio Moioli <[email protected]>

This was referenced Mar 4, 2024

[v2.0] fix(relatedresource): remove event handlers when context is closed #362

Merged

[v2.1] fix(relatedresource): remove event handlers when context is closed #363

Merged

moio approved these changes Mar 7, 2024

View reviewed changes

MbolotSuse merged commit 762039f into rancher:master Mar 7, 2024
2 checks passed

aruiz14 added a commit to aruiz14/rancher that referenced this pull request Mar 7, 2024

deps(wrangler): bump wrangler to v2.1.4

a79e8d8

In order to include rancher/wrangler#361

aruiz14 deleted the remove-event-handlers branch May 7, 2024 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(relatedresource): remove event handlers when context is closed #361

fix(relatedresource): remove event handlers when context is closed #361

aruiz14 commented Feb 15, 2024 •

edited

Loading

rmweir left a comment

tomleb left a comment •

edited

Loading

tomleb commented Feb 16, 2024

moio left a comment

fix(relatedresource): remove event handlers when context is closed #361

fix(relatedresource): remove event handlers when context is closed #361

Conversation

aruiz14 commented Feb 15, 2024 • edited Loading

rmweir left a comment

Choose a reason for hiding this comment

tomleb left a comment • edited Loading

Choose a reason for hiding this comment

tomleb commented Feb 16, 2024

moio left a comment

Choose a reason for hiding this comment

aruiz14 commented Feb 15, 2024 •

edited

Loading

tomleb left a comment •

edited

Loading