-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The error type is missing when stub.Run fails #83
Comments
If it is enough to tell apart whether NRI is disabled in the runtime (an ENOENT or ECONNREFUSED, I guess) or something else, then that might be already possible using stdlib
If that is not enough, then maybe what we should do is:
So maybe something like this below, but probably with a more carefully thought through and more minimal set of errors...
|
Thansk @klihub. I believe define and export errors for classes of failures may be useful for users. But I still have one question, it's difficult to distinguish between the two situations:
In these two situations, stub.connect() will return the same error. How do I tell apart these two situations. And do you think is it necessary to distinguish? If yes, can you give me an example and probably I can add this and reconnection code into the nri sample plugin. |
I think the core of this question is
And I think the answer is that without additional externally provided information there is no way of knowing this. And in this case your only option is to keep trying to reconnect. Now (assuming that you are in full control of your cluster environment and plugin deployment), there are at least two possible extra sources of information I can think of:
That said, I think it is generally not very common that the runtime is reconfigured/restarted on a cluster worker node without draining and then restarting the whole node. However, if you have some tooling of your own which nevertheless does this then another and perhaps more proper solution would be to have that tooling provide the necessary extra information for your plugin to know whether to attempt reconnection or not. For instance, it could label the node as 'runtime being restarted' before it does it's deed then remove that label once the runtime has been fully reconfigured/updated/etc. and restarted. Or to label the node as NRI disabled, if that tooling decides to disable NRI in the runtime. Then your plugin can watch whichever related labels your tooling provides and act accordingly. |
I want to implement NRI reconnect logic in onClose() which is when the NRI plugin loses connection to the NRI server, the NRI plugin attempts to reconnect. However, during the reconnection attempt, we need to detect the possible error types of the connection and handle them differently based on the error type. However, the current error return of the NRI plugin's Start function does not distinguish between error categories. For example, whether the connection was lost due to containerd restarting or NRI being disabled on the containerd side. If NRI is disabled on the containerd side, the NRI plugin does not need to reconnect anymore. If containerd simply restarts, the NRI plugin needs to keep trying to reconnect until the connection is successful. The corresponding code in the nri repository is https://github.com/containerd/nri/blob/main/pkg/stub/stub.go#L315. I paste below.
In the Start function, we can return errors because of several reasons:
Can we give more info for us to detect error types not an error string?
The text was updated successfully, but these errors were encountered: