-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: consider spawning a new process per job #584
Comments
I'm going to first try sending the artifact bytes over the socket, avoiding the need for any sandbox allowances at all. I was under the impression that it would take too long but it's worth measuring. |
In principle SPREEs could be used by many paracahins, so then we'd benefit from the final artifact, presumably an ELF .so file, being dlopened. Not a concern right now. |
Numbers on PVF overhead (per job, per byte of data in PoV) would be really helpful to track as they'd inform higher-level protocol things like bundling. i.e. the lower we can get these overheads, the smaller we can make candidates, and probably the higher utilization we can get out of approval cores. |
Just measured on local zombienet. The result is what I expected, but worth a try I guess.
Next I'll benchmark a new process per job. |
Am I reading this right? Is this meant to be microseconds or milliseconds? |
Shared memory is supposed to be much faster, but it would require a bigger rewrite to test it, and I fear it wouldn't fit well with Rust's ownership scheme and require Maybe worth doing a quick PoC or extending the benchmarks we have, but first I'll try one-process-per-job as it's much simpler to test. |
Locally on M1 Mac the spawning takes 4-12ms. That's already how long execution by itself takes. When I tried on a beefy Linux machine it was worse, at 20ms. In the grand scheme of things, maybe it's acceptable? It still comfortably fits within the backing timeout. cc @eskimor @s0me0ne-unkn0wn I tried the socket measurements again, this time on Linux, since my numbers on Mac were even worse than stack overflow's socket / shared memory numbers. On Linux I got 645us ( |
Have you tried spawning from memory yet? It could somewhat reduce the overhead.
The question is, how bad is that compared to reading the artifact from the fs? For the case if we read them only once (or never read them at all, preparing and keeping everything in memory) |
We probably can't keep all artifacts in memory while we scale to tens of thousands of registered parachains. We can give heads-up notifications to pre-load specific artifacts when we assign to a parachain in a approval-checking, but it'd probably only give a few hundred ms of lead time (probably enough). These are the numbers we're interested in:
Together it's our total overhead per validation job. Ideally we can get this as lean as possible on linux/1st-class platforms for validators. (relates to #607) |
20MB smallish ...? @s0me0ne-unkn0wn is this something we can easily bring down in the future?
Hmm, shouldn't it be 6ms? |
Checked with Moonbeam runtime (7.7 Mb uncompressed Wasm bytecode). Wasmtime produced a 28 Mb artifact. It's an unstripped ELF, If I strip it, it doesn't make much difference, ~300 Kb are gone, but that's it. My compiler produces 22.7 Mb artifact from the same bytecode, so the value is fair enough. Just for comparison, the adder runtime, 126 Kb Wasm bytecode, produces 149 Kb unstripped (143 Kb stripped) artifact with Wasmtime. My compiler produces 102 Kb artifact. |
As I mentioned before, we can compress them in-memory. The 28 Mb Moonbeam artifact is compressed with zstd (which we use in Substrate to compress Wasm blobs) down to 12.3 Mb. |
Let's see, I know a very good article where someone reduced an ELF binary quite significantly*). Very entertaining. 😎 *) Removed the spoiler. |
Thinking back to the original issue, which is that we want to prevent attackers from discovering other artifacts on disk. I found out that landlock has this access right for directories:
I believe we can apply this and then prepend artifact filenames with a random string, so the only way we should be able to access an artifact is when the host sends us a job for it. Edit: I wrote a test and the access right works as expected.
Isn't this also a performance hit, as we would have to decompress them each time? Still, could be worth evaluating. |
Interesting. This prevents access of other artifacts, but the artifact filename itself is now a source of randomness. We can rename the artifact to remove the random string before passing it for a job, and revert the name after. But then we have to account for races, as multiple workers may need the artifact concomittantly. Perhaps simpler is for each worker to have an expected file that it always reads, i.e. Footnotes |
I almost finished implementing the One nice thing about the existing file-based IPC is that the child writes a temporary prepared artifact, and the host atomically renames it to the final destination if the child succeeded. But, I don't see why we can't do the same on the host-side with shared memory. Also, I think the linking solution is going to be the best performance-wise, but shared memory should only incur a couple ms extra, max. I can do a quick PoC to double-check it. One other reason maybe not to do it is that the networking sandboxing would just be temporary until the full seccomp solution arrives. However, it seems that without a socket in play, there's just less of a chance of things going wrong here. I.e., if there's no socket at all, we could entirely block the socket read/write syscalls. It would be a big change so looking for feedback before I do it. |
Looks like shared memory was considered in the past: Haven't dug into it yet. But I don't like the solution I have with linking, because the child and host are very tightly coupled around the expected FS structure. Ofc they were already tightly coupled, but now the coupling is hidden and easy to break and must be documented everywhere. |
Currently we spawn worker processes for PVF jobs (prepare/execute) and each incoming job gets its own thread within that process.
This works fine but there are potential security issues with this as described in paritytech/polkadot#7580 (comment). Namely, we can't fully sandbox the process because we have to have an allow-exception for the entire PVF artifact cache directory.
We should investigate the overhead of spawning a whole separate process for each job as opposed to a thread. It may be less overhead now that we are spawning processes from smaller worker binaries instead of
polkadot
. If the cost is low enough, we can switch to one-process-per-job which allows us to sandbox the process better.This would probably also come with changes to the execution queue logic (most likely simplifications).
The text was updated successfully, but these errors were encountered: