Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #67 DialPipe problem with multiple calls / waiting for busy pipe #80

Merged
merged 2 commits into from
Jul 19, 2018

Conversation

gdamore
Copy link
Contributor

@gdamore gdamore commented Jun 24, 2018

(I believe this change also should be used instead of PR #75 -- that PR is potentially extraordinarily buggy.)

This changes a few things to try to ensure that we never wind up with
a result of ERROR_FILE_NOT_FOUND, due to a race between closing the
last pipe instance and opening the next.

First we keep an "open" client instance (unused) while the listener
is open, so that we are guaranteed to always have an active pipe
instance. This means attempts to open while no other instances exist
result in ERROR_PIPE_BUSY instead of ERROR_FILE_NOT_FOUND.

Second we have changed the loop for dialing to eliminate a race condition
that is more or less inherent in WaitNamedPipe when synchronizing with
CreateFile. The real timeout needs to be some larger value than the
WaitNamedPipe timeout, and furthermore WaitNamedPipe is not very nice
with the Go runtime, since it is a blocking system call. Instead we
just put the goroutine to sleep for 10 milliseconds, and keep retrying
the CreateFile until the maximum timeout is reached. If no timeout is
specified we assume a reasonable and large default of 5 seconds, which is
similar to a TCP connection timeout.

This isn't perfect, as a client attempting to connect to an extremely
busy pipe server can be starved out by other clients coming in while
it is in that brief sleep, but this potential race was already present
with WaitNamedPipe. The numerous retries (by default 500 retries!)
mean its pretty unlikely to occur, and if a single client hits the
race once, it has an excellent chance of getting in the next cycle.

(A real "fix" that is completely race free and fair would require
changes in the underlying Named Pipe implementation, or some other
kind of external coordination.)

… busy pipe

This changes a few things to try to ensure that we never wind up with
a result of ERROR_FILE_NOT_FOUND, due to a race between closing the
last pipe instance and opening the next.

First we keep an "open" client instance (unused) while the listener
is open, so that we are guaranteed to always have an active pipe
instance.  This means attempts to open while no other instances exist
result in ERROR_PIPE_BUSY instead of ERROR_FILE_NOT_FOUND.

Second we have changed the loop for dialing to eliminate a race condition
that is more or less inherent in WaitNamedPipe when synchronizing with
CreateFile.  The real timeout needs to be some larger value than the
WaitNamedPipe timeout, and furthermore WaitNamedPipe is not very nice
with the Go runtime, since it is a blocking system call.  Instead we
just put the goroutine to sleep for 10 milliseconds, and keep retrying
the CreateFile until the maximum timeout is reached.  If no timeout is
specified we assume a reasonable and large default of 5 seconds, which is
similar to a TCP connection timeout.

This isn't perfect, as a client attempting to connect to an extremely
busy pipe server can be starved out by other clients coming in while
it is in that brief sleep, but this potential race was already present
with WaitNamedPipe.  The numerous retries (by default 500 retries!)
mean its pretty unlikely to occur, and if a single client hits the
race once, it has an excellent chance of getting in the next cycle.

(A real "fix" that is completely race free and fair would require
changes in the underlying Named Pipe implementation, or some other
kind of external coordination.)
@msftclas
Copy link

msftclas commented Jun 24, 2018

CLA assistant check
All CLA requirements met.

@gdamore
Copy link
Contributor Author

gdamore commented Jun 24, 2018

Hmm... I probably would have liked to have added my own copyright and update to the code... (still MIT licensed.) If its possible to do that somewhere, text more or less to the effect of:

Portions Copyright 2018 Garrett D'Amore [email protected]

(or skip the Portions word.) . If this is too much a hassle, then don't worry about it.

@gdamore
Copy link
Contributor Author

gdamore commented Jun 24, 2018

This probably also fixes #46 -- since WaitNamedPipe was the only blocking code remaining.

@carlfischer1
Copy link

cc @jstarks @johnstep

@olljanat
Copy link

FYI. I was able to reproduce original issue by running my test version of Portainer on Windows Server, version 1803 and I cannot see it anymore on version which contains content from this PR (more detail on that portainer PR).

Any change to get this one merged?

@StefanScherer
Copy link

I can confirm that this PR helps Portainer work stable with the named pipe bind mounted into a Windows container running a Docker swarm on Windows Server 1803 with Docker EE 18.03.1-ee-1

Copy link

@salah-khan salah-khan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the most robust fix given how named pipes work on Windows.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 11, 2018

@jstarks @johnstep Ping? This PR addresses a very important reliability issue, and at least three different projects are waiting for this to be integrated. If there is something wrong with this, or concerns with what I've done here, then feedback would very much be appreciated.

@olljanat
Copy link

@jhowardmsft @darstahl I can see that both of you have merged stuff to this repo earlier.
Can you look this one also?

It looks that @jstarks is away.

@aiminickwong
Copy link

it's time to allow merge this pull ? @johnstep

@johnstep
Copy link
Member

I do not have permission to merge this.

@aiminickwong
Copy link

@jstarks time to allow merge this pull ?

@gdamore
Copy link
Contributor Author

gdamore commented Jul 13, 2018

Hard to believe I submitted this 19 days ago; the silence from the repo maintainers is deafening.

@jstarks
Copy link
Member

jstarks commented Jul 13, 2018

Sorry for the delays. I'm looking at this now. I want to write a quick test to understand the behavior of something before I merge this.

@jstarks
Copy link
Member

jstarks commented Jul 13, 2018

OK, I think this pull request is actually three changes:

  1. Increase the default timeout to 5 seconds.
  2. Poll instead of using WaitNamedPipe.
  3. Keep the original dummy client handle open for the lifetime of the pipe server.

I understand why increasing the timeout may be useful, but this seems like something the client can easily do without a change to go-winio. I do suspect that the default 50ms is probably too short to be useful given the problems with fairness. But I'm worried that extending it to a full 5 seconds might affect existing clients negatively. Maybe a compromise would be something like 250ms.

WaitNamedPipe has its problems. As you mention, it blocks the OS thread (which go-winio generally tries to avoid), it has various races, it does not guarantee that the client will win the race and be able to connect to the pipe, and most importantly for #67, it does not work reliably inside containers. It does have the advantage, though, that it wakes up immediately when a connection is available. I wonder if a better change wouldn't be to keep the WaitNamedPipe call, but if it fails any reason other than timeout, to sleep for 10ms and loop around again. That would probably resolve #67 and other race conditions.

This suggestion doesn't fix the issue that WaitNamedPipe blocks the OS thread, of course. If this is important to resolve, I would suggest making a change to use FSCTL_PIPE_WAIT with an asynchronous DeviceIoControl call. This fsctl is what WaitNamedPipe uses, and it is documented as part of the SMB protocol. It appears to have a non-blocking implementation.

I don't understand the third change. I can't see any behavior differences whether I keep the original client handle open or not. Unless you are aware of a behavior that I am not, I think it's better to close that client handle to avoid unnecessary resource consumption.

@olljanat
Copy link

@jstarks thanks for good comments.

I changed default timeout to 250 ms and added client handle closing now on this commit: https://github.com/olljanat/go-winio/commit/972aaec17501edc2ae66b43541f799eec50cf7c5 and did some testing with Portainer and I can tell that this combination at least still fixes #67

@gdamore I think that it is better if you comment about suggestion to still use WaitNamedPipe?

@gdamore
Copy link
Contributor Author

gdamore commented Jul 14, 2018 via email

@olljanat
Copy link

@jstarks / @gdamore so what we can do get this one fixed?

There is already three PRs #75 , this one and #84 which all are trying to solve same issue but bit different way. We should choose which one to focus and do needed changes to get is merged.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 18, 2018

#75 should be closed IMO, its just dead wrong. This one may have an issue that I had not considered, but while working at the C level in another library I encountered, so it may need work.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 18, 2018

Actually rereading the code (and refreshing my memory), we do open the instance with a client, meaning we keep two connections alive, so I think this is OK.

There is a really subtle possible race here, which is that if a client hits us right after we create the named pipe, and before our client can connect, we'll wind up failing the bind, as the client connection fails. This race means that a client has to race against the server doing listen.

A fix fort that, which I would be happy to follow up with, would be to check on the server side if the connected client is us, and to disconnect the remote client it if isn't us. We can detect that pretty easily by setting a flag in the listener when our client connects.

I believe that the changes here are better than what we had before, and the above refinement would just be a further improvement.

The changes in #84 are architecturally identical to what I've used in another software stack to workaround the same problems. Having said that, the code there is quite a bit more complex, harder to parse, and I really don't like that the transient error from CreateNamedPipe being deferred to the next Accept. (In my other code , I simply forcibly disconnect the client if this occurs.)

Note that #84 doesn't address all of the problems I've addressed here, specifically the use of WaitNamedPipe is problematic, and we can race and lose on the client side. The changes in #84 do address the concern of keeping the stake on the pipe, which is one of the elements that are also fixed by my changes here.

Upshot here, is that #84 is architecturally acceptable to me as a fix for one of these issues, but needs some further work IMO. It is however incomplete with respect to the full dimensions of problems.

As indicated, the changes here are also incomplete (niggling possible race) at Listen() time, but easily correctable.

I'm somewhat disinclined to invest further on this without clearer signals indicating that the work is likely to be useful and integrated -- I don't want to spend cycles on a PR that is going to just get dropped in favor of a different approach (or in favor of nothing at all, although if something doesn't get integrated I'll need to fork this for my own software.)

@gdamore
Copy link
Contributor Author

gdamore commented Jul 18, 2018

Hmm.. I have a question.

If the client handle is closed, but we haven't called DisconnectNamedPipe nor closed the server handle, is the server pipe instance still retained and busy (so no new client handles will connect to it, and so that the we won't get ERROR_NOT_FOUND in CreateFile on the client?)

If so, then closing the file handle would be perfectly reasonable. In such a case, there would be no need to retain the client handle. A better comment explaining this would be helpful. There is still that race condition I mentioned, where some other clients connects before we do. That would be unfortunate, and again would be easily fixed in a follow up.

@jstarks
Copy link
Member

jstarks commented Jul 18, 2018

What I've observed (and I think we have tests to confirm) is that keeping the server handle open is sufficient to retain ownership of the pipe and to ensure that clients get ERROR_PIPE_BUSY. So keeping the client handle open is not necessary.

Edit: agreed that a better comment would be useful here.

@jstarks
Copy link
Member

jstarks commented Jul 18, 2018

I'm inclined to take this change with the following tweaks:

  • Go back to closing the client handle.
  • Reduce the timeout to 2 seconds.

In the future I would like to reintroduce the WaitNamedPipe behavior using the FSCTL to avoid blocking the thread. But I don't think we need to hold this change for that now.

Agreed that there is an existing race in Listen() that this change doesn't fix. We can defer that for another change.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 18, 2018

This sounds great! If you will integrate this change, then I will follow up with a PR to fix that Listen() race this week.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 18, 2018

Do you want me to follow up with a modification to this PR for the above alterations (2 second timeout, close client handle) or address at your end?

@jstarks
Copy link
Member

jstarks commented Jul 18, 2018

If you have the time, I'd appreciate it.

@gdamore
Copy link
Contributor Author

gdamore commented Jul 18, 2018

Ok, coming shortly.

We can safely close the client handle, because the server side
is not closed, and thsi keeps our hold on the pipe instance.
@gdamore
Copy link
Contributor Author

gdamore commented Jul 18, 2018

Done. Would be good to test it before integrating. :-)

@jstarks
Copy link
Member

jstarks commented Jul 18, 2018

Thanks! I'll take a look a little later today and merge it. If @olljanat has time to validate it in his workload, that would be useful.

@olljanat
Copy link

@jstarks my workload looks working very nicely with this one.
@gdamore good work 👍

@jstarks jstarks merged commit a6d595a into microsoft:master Jul 19, 2018
@jstarks
Copy link
Member

jstarks commented Jul 19, 2018

Thanks @gdamore for the fix, @olljanat for verification, and everyone for their patience on this.

groob added a commit to groob/go-winio that referenced this pull request Jul 24, 2018
The timeout value was changed from 5 to 2 seconds in microsoft#80
@gdamore gdamore deleted the norace2 branch October 31, 2019 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants