Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argon2: Add helpers for running the KDF remotely #328

Merged
merged 43 commits into from
Jan 23, 2025

Conversation

chrisccoulson
Copy link
Collaborator

@chrisccoulson chrisccoulson commented Sep 3, 2024

As Argon2 is memory intensive, it's not suitable for multiple invocations in long-lived
garbage collected processes where multiple invocations may result in the process being
unable to perform new allocations, or even worse, triggering the kernel’s OOM killer.
For this reason, Argon2 is currently abstracted with an interface (Argon2KDF), of which
the application sets a global version of this which is intended to proxy KDF requests to
a short-lived remote process which uses the real InProcessArgon2KDF.

This adds some functionality to facilitate this.

Then there are JSON serializable types Argon2OutOfProcessRequest and
Argon2OutOfProcessResponse. The request can be fed directly to RunArgon2OutOfProcessRequest
on the remote side, but this is a fairly low-level API - the application still has to
deal with the transport.

There is a higher level API - NewOutOfProcessArgon2KDF, for use in the application process,
and which returns an implementation of Argon2KDF which proxies requests to a short-lived
remote handler process. The caller supplies a function to construct an appropriate exec.Cmd
instance for this. This function configures the exec.Cmd so that the handler process receives
a request on stdin and it expects a response on stdout. The handler process is expected to pass
both os.Stdin and os.Stdout to WaitForAndRunArgon2OutOfProcessRequest, although it doesn't
hardcode these descriptors for implementations that want to construct their own processes with
transports that don't rely on stdin and stdout.

Once a handler process has completed a request, it should exit cleanly. Neither
RunArgon2OutOfProcessRequest or WaitForAndRunArgon2OutOfProcessRequest support being called
more than once in the same process.

The code in cmd/run_argon2 provides an example handler process, although this is mainly useful for
unit testing (where it is currently used). It is envisaged that the handler process will be a
special mode of snapd and snap-bootstrap in order to avoid adding an additional go binary just for
this.

As Argon2 is memory intensive, it's not suitable for multiple invocations in
long-lived garbage collected processes. For this reason, Argon2 is
abstracted with an interface (Argon2KDF), of which the application sets
a global version of this which is intended to proxy KDF requests to a
short-lived remote process which uses the real InProcessArgon2KDF.

This adds some functionality to facilitate this.

First of all, InProcessArgon2KDF is no longer a variable - it's a
function. By default, it's methods return an error unless the
application code has called SetIsArgon2RemoteProcess, which unlocks the
real in-process KDF.

Then there are JSON serializable types "Argon2RemoteInput" and
"Argon2RemoteOutput". The input can be fed directly to
RunArgon2RequestInRemoteProcess on the remote side, but this is a
fairly low-level API.

There is a higher level API - NewRemoteArgon2KDF, for use in the
application process, and which returns an implementation of Argon2KDF
which proxies requests to a short-lived remote helper process. The
caller supplied a function to construct an appropriate exec.Cmd
instance for this. This function is configured so that the remote
process recieves a request on stdin and it expects a response on
stdout. The remote process passes both os.Stdin and os.Stdout to
WaitAndRunArgon2RequestInRemoteProcess, although it doesn't hardcode
these descriptors for implementations that want to construct their own
transport that doesn't rely on stdin and stdout.

Once a remote process has completed a request, it should exit cleanly.
Neither RunArgon2RequestInRemoteProcess or
WaitAndRunArgon2RequestInRemoteProcess support being called more than
once in the same process.

The code in cmd/run_argon2 provides an example remote process, although
this is mainly useful for unit testing (where it is currently used). It
is envisaged that the remote process will be a special mode of snapd and
snap-bootstrap in order to avoid adding an additional new go binary just
for this.
@chrisccoulson chrisccoulson removed the request for review from pedronis September 4, 2024 14:24
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first pass

"github.com/snapcore/secboot/internal/argon2"
)

type nullInProcessArgon2KDFImpl struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null is a bit of a strange name for something that actually fails

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type has gone, although there is still a nullArgon2KDFImpl that existed previously.

defer func() {
// Run Cmd.Wait in a defer so that we shut down on error paths too,
// and we capture the Wait error if there was no other error.
waitErr := cmd.Wait()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a timeout mechanism around this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this quite a bit now. Not only is there a timeout mechanism when waiting for the remote process to exit (and go 1.20 introduces a nicer way to do this), there is also an optional watchdog mechanism that exercises the command processing / response sending code on the remote side, so that the parent knows that it's eventually going to get a response and can terminate the operation if this fails.

// os.Stdout when using - certainly when using [NewRemoteArgon2KDF] on the parent side.
// This function can only be called once in a process. Subsequent calls in the same
// process will result in an error response being returned.
func WaitAndRunArgon2RequestInRemoteProcess(in io.Reader, out io.Writer) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function code does not seem to be exercised

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets executed in a sub-process (cmd/run_argon2) by any test that makes use of NewOutOfProcessArgon2KDF, which might be why it doesn't appear in the test coverage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an in-process test of this function so that it appears in the test coverage.

export_test.go Outdated
@@ -148,6 +150,11 @@ func MockHashAlgAvailable() (restore func()) {
}
}

func ClearIsArgon2RemoteProcess() {
atomic.StoreUint32(&argon2RemoteProcessStatus, notArgon2RemoteProcess)
runtime.GC()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the GC here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to this function to explain why.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this function (and the corresponding SetIsArgon2RemoteProcess entirely now.

}
}

switch input.Command {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a bit more of the error cases below should be tested

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some more test cases.

@pedronis
Copy link
Collaborator

pedronis commented Sep 4, 2024

@chrisccoulson Also OutOfProcess might be a clearer term than Remote, if a bit longer

@chrisccoulson
Copy link
Collaborator Author

@chrisccoulson Also OutOfProcess might be a clearer term than Remote, if a bit longer

I've renamed things to OutOfProcess.

This includes some other small refactorigns, eg, the handler process
makes use of the global Argon2KDF implementation now in the same way
that the parent process does.
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes, maybe I missed it but it doesn't seem you replied to my question about what to do if the subprocess takes way too long/gets stuck for some reason

@chrisccoulson
Copy link
Collaborator Author

thanks for the changes, maybe I missed it but it doesn't seem you replied to my question about what to do if the subprocess takes way too long/gets stuck for some reason

Aha, I'll think of a way to address that. As the process is meant to block for some time, I wonder whether creating some sort of regular watchdog or ping would be more appropriate here. It shouldn't be that hard to add to the protocol.

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a pass, there is quite a few moving parts here, probably a good idea for @valentindavid @bboozzoo and @ZeyadYasser to take a look for awareness and possibly review as well

argon2.go Outdated
@@ -119,14 +130,15 @@ func (o *Argon2Options) kdfParams(keyLen uint32) (*kdfParams, error) {

mode := o.Mode
if mode == Argon2Default {
//
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot the comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did. I've added a comment now.

// file we have opened. Try again with a 10ms backoff time,
// unless the request timeout fires first.
select {
case <-time.NewTimer(10 * time.Millisecond).C:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this a bit too aggressive? how long does it take usually to compute the actual KDF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we have no idea how long it will be before we can acquire the lock, so it's a bit of a trade-off between being aggressive and waiting longer than is necessary, where the lock ends up unheld for longer for a period. I'm not sure what time to use for the backoff here tbh - the other option is to make the call to flock blocking, but then there's no way for it to be able to timeout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to 100ms.

select {
case <-timeoutTimer.C:
// The timeout has expired.
return nil, errArgon2OutOfProcessHandlerSystemLockTimeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case is not reached by tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hit and miss whether the timeout on line 154 gets tested, or this one gets tested - it depends where the code lands before the timer expires. I'm not sure of a better way of doing it but I'll try and think of a way to make it a bit more reproducible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this code to make it more reproducible where the code lands during tests.

Comment on lines 543 to 544
var startupWg sync.WaitGroup
startupWg.Add(1) // Mark the WaitGroup as waiting for a single event.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use a channel for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind, I just picked sync.WaitGroup because it seems like it's designed for this use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to a channel now.

These tests keep passing locally and failing in github, and memory usage
is the only thing I can think of that might be causing the failures.
Also abort WaitForAndRunArgon2OutOfProcessRequest on an error response,
as the underlying RunArgon2OutOfProcessRequest function can't be called
again anyway.
argon2.go Outdated
benchmarkParams := &argon2.BenchmarkParams{
MaxMemoryCostKiB: 1 * 1024 * 1024, // the default maximum memory cost is 1GiB.
TargetDuration: 2 * time.Second, // the default target duration is 2s.
}

if o.MemoryKiB != 0 {
// The memory cost has been spcified expli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The memory cost has been spcified expli
// The memory cost has been specified explicitly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this.

Comment on lines 132 to 137
str := "cannot process request: " + string(e.ErrorType)
if e.ErrorString != "" {
str += " ("
str += e.ErrorString
str += ")"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use strings.Builder(), it's available from 1.10+

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using that now.

return actualRsp, nil
}

func (k *outOfProcessArgon2KDFImpl) Derive(passphrase string, salt []byte, mode Argon2Mode, params *Argon2CostParams, keyLen uint32) (key []byte, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's part of the existing interface, but if params is always expected to be non nil, this could pass value not a pointer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, but given that this interface has been around for a while, shall we leave this one for another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yet, we can do a followup, or leave a TODO

argon2_out_of_process_support.go Show resolved Hide resolved
argon2_out_of_process_support.go Show resolved Hide resolved
// eventually returning an error to the caller.
var req *Argon2OutOfProcessRequest
dec := json.NewDecoder(in)
dec.DisallowUnknownFields()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

means that the both ends need to be in sync now, which hopefully is true most of the time, but should there be some format, version field in requests/responses to indicate protocol version?

Just to be clear, I would not want to introduce any form to a handshake, as this would complicate the code even more, but a clear version check would help with debugging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the calls to DisallowUnknownFields everywhere it was used.

startupCh := make(chan struct{})

tmb.Go(func() error {
startupCh <- struct{}{} // Unblock the main routine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have the same effect

Suggested change
startupCh <- struct{}{} // Unblock the main routine.
close(startupCh)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change now.

argon2_out_of_process_support.go Show resolved Hide resolved
This updates the x/crypto repository to the latest version we possibly
can whilst still maintaining compatibility with go 1.18.
@pedronis pedronis requested a review from bboozzoo December 10, 2024 08:53
Copy link
Contributor

@bboozzoo bboozzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, I need to do another pass with a clear head

// Argon2OutOfProcessHandlerSystemLockPath is the lock file path used to
// serialize KDF requests system-wide. All process's that use the system-wide
// lock participate in the lock/unlock contract described above.
Argon2OutOfProcessHandlerSystemLockPath = filepath.Join(RunDir, "snapd/argon2.lock")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if secbot was consumed by other project than snapd, how could they set a different path given they would not be able to import secboot's internal package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the path already includes a directory which is not guaranteed to exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion with @chrisccoulson we'll use a snapd agnostic path, or consider making the lock file location an argument

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to /run/secboot_argon2.lock

Comment on lines +181 to +182
// Note that Argon2 requests are serialized using a system-wide lock, which this function does not
// explicitly release. If the lock is acquired, it returns a callback that the caller may choose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I were to run 2 out of process instances, requests to each instance would be serialized on the same shared lock. Why not a per instance lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion with @chrisccoulson Conceptually follows the same approach as cryptsetup, trying to keep the lock as close to the process as possible.


costParams := &Argon2CostParams{
Time: request.Time,
MemoryKiB: request.MemoryKiB,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can memory be 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion with @chrisccoulson 0 is valid, the code will use a minimum of 32kB.

@chrisccoulson Can you add a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment explaining why we don't test MemoryKiB here.

argon2_out_of_process_support.go Show resolved Hide resolved
// future, which might require replacing the existing library we use for Argon2.
func RunArgon2OutOfProcessRequest(request *Argon2OutOfProcessRequest) (response *Argon2OutOfProcessResponse, lockRelease func()) {
// Perform checks of arguments that are common to call requests
switch request.Mode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to move the request validation to separate helper, eg:

func validateOutOfProcessRequest(req *Argon2OutOfProcessRequest ) error {
    ..
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just cosmetics, not high priority

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might look at this in a follow-up

//
// The implementation is expected to be paired with an equivalent implementation of
// [Argon2OutOfProcessWatchdogMonitor] in the parent process.
type Argon2OutOfProcessWatchdogHandler = func(challenge []byte) (response []byte, err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless this needs to be a type alias?

Suggested change
type Argon2OutOfProcessWatchdogHandler = func(challenge []byte) (response []byte, err error)
type Argon2OutOfProcessWatchdogHandler func(challenge []byte) (response []byte, err error)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are no longer type aliases

Comment on lines +405 to +412
// Note that Argon2 requests are serialized using a system-wide lock, which this function does not
// explicitly release. If the lock is acquired, it returns a callback that the caller may choose
// to execute in order to explicitly release the lock, or the caller can just leave it to be
// implicitly released on process exit. If the lock is explicitly released, the caller must be
// sure that the large amount of memory allocated for the Argon2 operation has been freed and
// returned back to the OS, else this defeats the point of having a system-wide lock (to avoid
// having multiple processes with high physical memory requirements running at the same time). If
// the lock wasn't acquired, no release callback will be returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the locking exercise was left to the caller in the parent process? it feels like this is imposing a significant limitation but without visibility by the users of secboot.

}

// Attempt to open the lock file for writing.
lockFile, err = os.OpenFile(paths.Argon2OutOfProcessHandlerSystemLockPath, os.O_RDWR|os.O_CREATE, 0600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| syscall.O_NOFOLLOW | syscall.O_CLOEXEC ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLOEXEC may be set by default, let's double check

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added O_NOFOLLOW now, and verified that O_CLOEXEC is set b the go runtime.

argon2_out_of_process_support_sync.go Show resolved Hide resolved
The gopkg.in/tomb.v2 package does not handle routines that panic very
well - the active routine count will not decrement and it won't put the
tomb into a dying state. In fact, it looks like it makes it impossible
for the tomb to ever reach a dead state.

If the Argon2 crypto package panics, the KDF routine will disappear but
the child process will carry on processing watchdog requests despite the
fact that the parent is never going to get an answer.

We recover panics on the KDF routine and turn them into errors that put
the tomb into a dying state. This begins the teardown of the child
process and halts the processing of watchdog requests. For this reason,
the parent may see 1 of 2 errors - either an error indicating that the
child process exited with a non-zero exit code, or an error indicating a
watchdog timeout. The test added for this handles both cases.

This also introduces a small package - internal/testenv - with a single
function (testenv.IsTestBinary) that can be used in production code and
requires some special linker flags to enable. This permits us to export
a function from the core secboot library to mock an internal function
that should only be accessible in unit tests (and should panic
otherwise).
I was going to also pass O_CLOEXEC, but the go runtime already does
this.
@chrisccoulson
Copy link
Collaborator Author

This is ready for (hopefully the final 🙂) review again, although the unit tests currently fail because #363 needs to land

Copy link
Contributor

@bboozzoo bboozzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, one comment about the lock filename

nitpicks for possibly later:

  • I think remote is strange terminology for this, I'm happy we fixed the code name, but the comments should probably use external/separate more instead of remote
  • maybe the _support.go file should be split into invocation and external code, with common bits in the invocation .go file

// Argon2OutOfProcessHandlerSystemLockPath is the lock file path used to
// serialize KDF requests system-wide. All process's that use the system-wide
// lock participate in the lock/unlock contract described above.
Argon2OutOfProcessHandlerSystemLockPath = filepath.Join(RunDir, "secboot_argon2.lock")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but secboot-argon2.lock would seem more in line with other naming

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me - I've pushed an update to rename this.

@chrisccoulson chrisccoulson merged commit 5fbde5b into canonical:master Jan 23, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants