-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argon2: Add helpers for running the KDF remotely #328
argon2: Add helpers for running the KDF remotely #328
Conversation
As Argon2 is memory intensive, it's not suitable for multiple invocations in long-lived garbage collected processes. For this reason, Argon2 is abstracted with an interface (Argon2KDF), of which the application sets a global version of this which is intended to proxy KDF requests to a short-lived remote process which uses the real InProcessArgon2KDF. This adds some functionality to facilitate this. First of all, InProcessArgon2KDF is no longer a variable - it's a function. By default, it's methods return an error unless the application code has called SetIsArgon2RemoteProcess, which unlocks the real in-process KDF. Then there are JSON serializable types "Argon2RemoteInput" and "Argon2RemoteOutput". The input can be fed directly to RunArgon2RequestInRemoteProcess on the remote side, but this is a fairly low-level API. There is a higher level API - NewRemoteArgon2KDF, for use in the application process, and which returns an implementation of Argon2KDF which proxies requests to a short-lived remote helper process. The caller supplied a function to construct an appropriate exec.Cmd instance for this. This function is configured so that the remote process recieves a request on stdin and it expects a response on stdout. The remote process passes both os.Stdin and os.Stdout to WaitAndRunArgon2RequestInRemoteProcess, although it doesn't hardcode these descriptors for implementations that want to construct their own transport that doesn't rely on stdin and stdout. Once a remote process has completed a request, it should exit cleanly. Neither RunArgon2RequestInRemoteProcess or WaitAndRunArgon2RequestInRemoteProcess support being called more than once in the same process. The code in cmd/run_argon2 provides an example remote process, although this is mainly useful for unit testing (where it is currently used). It is envisaged that the remote process will be a special mode of snapd and snap-bootstrap in order to avoid adding an additional new go binary just for this.
4cfd159
to
017287b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a first pass
argon2_remote_support.go
Outdated
"github.com/snapcore/secboot/internal/argon2" | ||
) | ||
|
||
type nullInProcessArgon2KDFImpl struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null is a bit of a strange name for something that actually fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type has gone, although there is still a nullArgon2KDFImpl
that existed previously.
argon2_remote_support.go
Outdated
defer func() { | ||
// Run Cmd.Wait in a defer so that we shut down on error paths too, | ||
// and we capture the Wait error if there was no other error. | ||
waitErr := cmd.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a timeout mechanism around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this quite a bit now. Not only is there a timeout mechanism when waiting for the remote process to exit (and go 1.20 introduces a nicer way to do this), there is also an optional watchdog mechanism that exercises the command processing / response sending code on the remote side, so that the parent knows that it's eventually going to get a response and can terminate the operation if this fails.
argon2_remote_support.go
Outdated
// os.Stdout when using - certainly when using [NewRemoteArgon2KDF] on the parent side. | ||
// This function can only be called once in a process. Subsequent calls in the same | ||
// process will result in an error response being returned. | ||
func WaitAndRunArgon2RequestInRemoteProcess(in io.Reader, out io.Writer) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function code does not seem to be exercised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets executed in a sub-process (cmd/run_argon2
) by any test that makes use of NewOutOfProcessArgon2KDF
, which might be why it doesn't appear in the test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an in-process test of this function so that it appears in the test coverage.
export_test.go
Outdated
@@ -148,6 +150,11 @@ func MockHashAlgAvailable() (restore func()) { | |||
} | |||
} | |||
|
|||
func ClearIsArgon2RemoteProcess() { | |||
atomic.StoreUint32(&argon2RemoteProcessStatus, notArgon2RemoteProcess) | |||
runtime.GC() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the GC here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment to this function to explain why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this function (and the corresponding SetIsArgon2RemoteProcess
entirely now.
argon2_remote_support.go
Outdated
} | ||
} | ||
|
||
switch input.Command { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a bit more of the error cases below should be tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some more test cases.
@chrisccoulson Also OutOfProcess might be a clearer term than Remote, if a bit longer |
I've renamed things to |
This includes some other small refactorigns, eg, the handler process makes use of the global Argon2KDF implementation now in the same way that the parent process does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the changes, maybe I missed it but it doesn't seem you replied to my question about what to do if the subprocess takes way too long/gets stuck for some reason
Aha, I'll think of a way to address that. As the process is meant to block for some time, I wonder whether creating some sort of regular watchdog or ping would be more appropriate here. It shouldn't be that hard to add to the protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a pass, there is quite a few moving parts here, probably a good idea for @valentindavid @bboozzoo and @ZeyadYasser to take a look for awareness and possibly review as well
argon2.go
Outdated
@@ -119,14 +130,15 @@ func (o *Argon2Options) kdfParams(keyLen uint32) (*kdfParams, error) { | |||
|
|||
mode := o.Mode | |||
if mode == Argon2Default { | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did. I've added a comment now.
// file we have opened. Try again with a 10ms backoff time, | ||
// unless the request timeout fires first. | ||
select { | ||
case <-time.NewTimer(10 * time.Millisecond).C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this a bit too aggressive? how long does it take usually to compute the actual KDF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we have no idea how long it will be before we can acquire the lock, so it's a bit of a trade-off between being aggressive and waiting longer than is necessary, where the lock ends up unheld for longer for a period. I'm not sure what time to use for the backoff here tbh - the other option is to make the call to flock
blocking, but then there's no way for it to be able to timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to 100ms.
select { | ||
case <-timeoutTimer.C: | ||
// The timeout has expired. | ||
return nil, errArgon2OutOfProcessHandlerSystemLockTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case is not reached by tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hit and miss whether the timeout on line 154 gets tested, or this one gets tested - it depends where the code lands before the timer expires. I'm not sure of a better way of doing it but I'll try and think of a way to make it a bit more reproducible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this code to make it more reproducible where the code lands during tests.
argon2_out_of_process_support.go
Outdated
var startupWg sync.WaitGroup | ||
startupWg.Add(1) // Mark the WaitGroup as waiting for a single event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to use a channel for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind, I just picked sync.WaitGroup
because it seems like it's designed for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to a channel now.
These tests keep passing locally and failing in github, and memory usage is the only thing I can think of that might be causing the failures.
Also abort WaitForAndRunArgon2OutOfProcessRequest on an error response, as the underlying RunArgon2OutOfProcessRequest function can't be called again anyway.
…sts more reproducible
argon2.go
Outdated
benchmarkParams := &argon2.BenchmarkParams{ | ||
MaxMemoryCostKiB: 1 * 1024 * 1024, // the default maximum memory cost is 1GiB. | ||
TargetDuration: 2 * time.Second, // the default target duration is 2s. | ||
} | ||
|
||
if o.MemoryKiB != 0 { | ||
// The memory cost has been spcified expli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The memory cost has been spcified expli | |
// The memory cost has been specified explicitly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this.
argon2_out_of_process_support.go
Outdated
str := "cannot process request: " + string(e.ErrorType) | ||
if e.ErrorString != "" { | ||
str += " (" | ||
str += e.ErrorString | ||
str += ")" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use strings.Builder(), it's available from 1.10+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using that now.
return actualRsp, nil | ||
} | ||
|
||
func (k *outOfProcessArgon2KDFImpl) Derive(passphrase string, salt []byte, mode Argon2Mode, params *Argon2CostParams, keyLen uint32) (key []byte, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's part of the existing interface, but if params is always expected to be non nil, this could pass value not a pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable, but given that this interface has been around for a while, shall we leave this one for another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yet, we can do a followup, or leave a TODO
argon2_out_of_process_support.go
Outdated
// eventually returning an error to the caller. | ||
var req *Argon2OutOfProcessRequest | ||
dec := json.NewDecoder(in) | ||
dec.DisallowUnknownFields() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
means that the both ends need to be in sync now, which hopefully is true most of the time, but should there be some format
, version
field in requests/responses to indicate protocol version?
Just to be clear, I would not want to introduce any form to a handshake, as this would complicate the code even more, but a clear version check would help with debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the calls to DisallowUnknownFields
everywhere it was used.
argon2_out_of_process_support.go
Outdated
startupCh := make(chan struct{}) | ||
|
||
tmb.Go(func() error { | ||
startupCh <- struct{}{} // Unblock the main routine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have the same effect
startupCh <- struct{}{} // Unblock the main routine. | |
close(startupCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change now.
This updates the x/crypto repository to the latest version we possibly can whilst still maintaining compatibility with go 1.18.
This ended up catching a bug that resulted in a deadlock if the locking function had to perform a retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, I need to do another pass with a clear head
internal/paths/paths.go
Outdated
// Argon2OutOfProcessHandlerSystemLockPath is the lock file path used to | ||
// serialize KDF requests system-wide. All process's that use the system-wide | ||
// lock participate in the lock/unlock contract described above. | ||
Argon2OutOfProcessHandlerSystemLockPath = filepath.Join(RunDir, "snapd/argon2.lock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if secbot was consumed by other project than snapd, how could they set a different path given they would not be able to import secboot's internal package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, the path already includes a directory which is not guaranteed to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per discussion with @chrisccoulson we'll use a snapd agnostic path, or consider making the lock file location an argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to /run/secboot_argon2.lock
// Note that Argon2 requests are serialized using a system-wide lock, which this function does not | ||
// explicitly release. If the lock is acquired, it returns a callback that the caller may choose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I were to run 2 out of process instances, requests to each instance would be serialized on the same shared lock. Why not a per instance lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per discussion with @chrisccoulson Conceptually follows the same approach as cryptsetup, trying to keep the lock as close to the process as possible.
|
||
costParams := &Argon2CostParams{ | ||
Time: request.Time, | ||
MemoryKiB: request.MemoryKiB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can memory be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per discussion with @chrisccoulson 0 is valid, the code will use a minimum of 32kB.
@chrisccoulson Can you add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment explaining why we don't test MemoryKiB
here.
// future, which might require replacing the existing library we use for Argon2. | ||
func RunArgon2OutOfProcessRequest(request *Argon2OutOfProcessRequest) (response *Argon2OutOfProcessResponse, lockRelease func()) { | ||
// Perform checks of arguments that are common to call requests | ||
switch request.Mode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to move the request validation to separate helper, eg:
func validateOutOfProcessRequest(req *Argon2OutOfProcessRequest ) error {
..
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just cosmetics, not high priority
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might look at this in a follow-up
argon2_out_of_process_support.go
Outdated
// | ||
// The implementation is expected to be paired with an equivalent implementation of | ||
// [Argon2OutOfProcessWatchdogMonitor] in the parent process. | ||
type Argon2OutOfProcessWatchdogHandler = func(challenge []byte) (response []byte, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless this needs to be a type alias?
type Argon2OutOfProcessWatchdogHandler = func(challenge []byte) (response []byte, err error) | |
type Argon2OutOfProcessWatchdogHandler func(challenge []byte) (response []byte, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are no longer type aliases
// Note that Argon2 requests are serialized using a system-wide lock, which this function does not | ||
// explicitly release. If the lock is acquired, it returns a callback that the caller may choose | ||
// to execute in order to explicitly release the lock, or the caller can just leave it to be | ||
// implicitly released on process exit. If the lock is explicitly released, the caller must be | ||
// sure that the large amount of memory allocated for the Argon2 operation has been freed and | ||
// returned back to the OS, else this defeats the point of having a system-wide lock (to avoid | ||
// having multiple processes with high physical memory requirements running at the same time). If | ||
// the lock wasn't acquired, no release callback will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the locking exercise was left to the caller in the parent process? it feels like this is imposing a significant limitation but without visibility by the users of secboot.
} | ||
|
||
// Attempt to open the lock file for writing. | ||
lockFile, err = os.OpenFile(paths.Argon2OutOfProcessHandlerSystemLockPath, os.O_RDWR|os.O_CREATE, 0600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| syscall.O_NOFOLLOW | syscall.O_CLOEXEC ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLOEXEC may be set by default, let's double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added O_NOFOLLOW now, and verified that O_CLOEXEC is set b the go runtime.
The gopkg.in/tomb.v2 package does not handle routines that panic very well - the active routine count will not decrement and it won't put the tomb into a dying state. In fact, it looks like it makes it impossible for the tomb to ever reach a dead state. If the Argon2 crypto package panics, the KDF routine will disappear but the child process will carry on processing watchdog requests despite the fact that the parent is never going to get an answer. We recover panics on the KDF routine and turn them into errors that put the tomb into a dying state. This begins the teardown of the child process and halts the processing of watchdog requests. For this reason, the parent may see 1 of 2 errors - either an error indicating that the child process exited with a non-zero exit code, or an error indicating a watchdog timeout. The test added for this handles both cases. This also introduces a small package - internal/testenv - with a single function (testenv.IsTestBinary) that can be used in production code and requires some special linker flags to enable. This permits us to export a function from the core secboot library to mock an internal function that should only be accessible in unit tests (and should panic otherwise).
I was going to also pass O_CLOEXEC, but the go runtime already does this.
This is ready for (hopefully the final 🙂) review again, although the unit tests currently fail because #363 needs to land |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, one comment about the lock filename
nitpicks for possibly later:
- I think remote is strange terminology for this, I'm happy we fixed the code name, but the comments should probably use external/separate more instead of remote
- maybe the _support.go file should be split into invocation and external code, with common bits in the invocation .go file
internal/paths/paths.go
Outdated
// Argon2OutOfProcessHandlerSystemLockPath is the lock file path used to | ||
// serialize KDF requests system-wide. All process's that use the system-wide | ||
// lock participate in the lock/unlock contract described above. | ||
Argon2OutOfProcessHandlerSystemLockPath = filepath.Join(RunDir, "secboot_argon2.lock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick but secboot-argon2.lock would seem more in line with other naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine with me - I've pushed an update to rename this.
As Argon2 is memory intensive, it's not suitable for multiple invocations in long-lived
garbage collected processes where multiple invocations may result in the process being
unable to perform new allocations, or even worse, triggering the kernel’s OOM killer.
For this reason, Argon2 is currently abstracted with an interface (
Argon2KDF
), of whichthe application sets a global version of this which is intended to proxy KDF requests to
a short-lived remote process which uses the real
InProcessArgon2KDF
.This adds some functionality to facilitate this.
Then there are JSON serializable types
Argon2OutOfProcessRequest
andArgon2OutOfProcessResponse
. The request can be fed directly toRunArgon2OutOfProcessRequest
on the remote side, but this is a fairly low-level API - the application still has to
deal with the transport.
There is a higher level API -
NewOutOfProcessArgon2KDF
, for use in the application process,and which returns an implementation of
Argon2KDF
which proxies requests to a short-livedremote handler process. The caller supplies a function to construct an appropriate
exec.Cmd
instance for this. This function configures the
exec.Cmd
so that the handler process receivesa request on stdin and it expects a response on stdout. The handler process is expected to pass
both
os.Stdin
andos.Stdout
toWaitForAndRunArgon2OutOfProcessRequest
, although it doesn'thardcode these descriptors for implementations that want to construct their own processes with
transports that don't rely on stdin and stdout.
Once a handler process has completed a request, it should exit cleanly. Neither
RunArgon2OutOfProcessRequest
orWaitForAndRunArgon2OutOfProcessRequest
support being calledmore than once in the same process.
The code in cmd/run_argon2 provides an example handler process, although this is mainly useful for
unit testing (where it is currently used). It is envisaged that the handler process will be a
special mode of snapd and snap-bootstrap in order to avoid adding an additional go binary just for
this.