-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(relatedresource): remove event handlers when context is closed #361
Conversation
098e010
to
5a201f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting findings. I don't understand everything yet that's going on in wrangler but I did my best to verify the fix. I created a small Go program that would register/unregister a controller for ConfigMaps
with relatedresource being Secret
(heh, doesn't matter). Without the fix the number of goroutines keep going up, with the fix the number of go routine stays static.
EDIT: Adding code here, might be useful, took me a while getting it to work..
main.go
package main
import (
"context"
"fmt"
"os"
goruntime "runtime"
"time"
"github.com/rancher/lasso/pkg/controller"
"github.com/rancher/wrangler/v2/pkg/generated/controllers/core"
"github.com/rancher/wrangler/v2/pkg/generic"
"github.com/rancher/wrangler/v2/pkg/kubeconfig"
"github.com/rancher/wrangler/v2/pkg/relatedresource"
"github.com/rancher/wrangler/v2/pkg/schemes"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
)
var Scheme = runtime.NewScheme()
func init() {
metav1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})
utilruntime.Must(schemes.AddToScheme(Scheme))
}
func main() {
restConfig, err := kubeconfig.GetNonInteractiveClientConfig(os.Getenv("KUBECONFIG")).ClientConfig()
must(err)
controllerFactory, err := controller.NewSharedControllerFactoryFromConfigWithOptions(restConfig, Scheme, nil)
must(err)
opts := &generic.FactoryOptions{
SharedControllerFactory: controllerFactory,
}
core, err := core.NewFactoryFromConfigWithOptions(restConfig, opts)
must(err)
parentCtx := context.Background()
ctx, cancel := context.WithCancel(parentCtx)
Register(ctx, core)
time.Sleep(5 * time.Second)
cancel()
time.Sleep(2 * time.Second)
fmt.Println("SharedCacheFactory.Start")
err = controllerFactory.SharedCacheFactory().Start(parentCtx)
must(err)
fmt.Println("WaitForCacheSync")
controllerFactory.SharedCacheFactory().WaitForCacheSync(parentCtx)
fmt.Println("ControllerFactory.Start")
err = controllerFactory.Start(parentCtx, 10)
must(err)
for {
ctx, cancel := context.WithCancel(parentCtx)
Register(ctx, core)
cancel()
time.Sleep(2 * time.Second)
fmt.Println(goruntime.NumGoroutine())
}
}
func Register(ctx context.Context, core *core.Factory) {
core.Core().V1().ConfigMap().OnChange(ctx, "cm", func(_ string, cm *corev1.ConfigMap) (*corev1.ConfigMap, error) {
return cm, nil
})
resolve := func(namespace, name string, obj runtime.Object) ([]relatedresource.Key, error) {
return []relatedresource.Key{
{Namespace: "default", Name: "kube-root-ca.crt"},
}, nil
}
relatedresource.Watch(
ctx,
"cm-secret",
resolve,
core.Core().V1().ConfigMap(),
core.Core().V1().Secret())
}
func must(err error) {
if err != nil {
panic(err)
}
}
It seems like handlers added with |
includes rancher/wrangler#361 Signed-off-by: Silvio Moioli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
In order to include rancher/wrangler#361
Fixes rancher/rancher#44090
I'm investigating a memory leak in Rancher manager, which makes the heap to constantly increase overtime (~1GB per day in the case I'm observing). By analyzing the memory profiles produces by a custom build (including this patch), I was able to determine that most of this memory was used for downstream cluster caches (of various Kubernetes kinds).
I managed to reproduce this symptom at a smaller scale locally, by repeatedly registering/unregistering a downstream cluster, then observing how the baseline for the heap increased (and returned to normal values after a restart). This allowed me to use a debugger and custom pprof profiles to narrow down which goroutines could be leaked, which pointed me to 2 different cases, both using
relatedresource
:Register
function being called per-cluster, this line registers the handler in the management context (which points to the upstream cluster), which causes the same handler to be registered many times.sync.Once
.relatedresource.Watch
will use an informer for the management context to enqueue theConfigMaps
controller in the user context, with multiple consequences:Enqueue
), and it's a shared controller, which also holds a reference to aSharedCacheFactory
, hence leaking all the downstream caches.I noticed that
relatedresource
configures 2 different handlers, one using the higher levelAddGenericHandler
, which is context-aware, and a lower-level one that just implementsDeleteFunc
using the controller's Informer.While I'm not familiar enough with wrangler to know if both of them are really needed, here I'm taking the chance of
client-go
being recently upgraded to make the latter case also be context-aware.