Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFENCE after streaming loops #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

workingjubilee
Copy link

MOVNTI, MOVNTDQ, and friends weaken TSO when next to other stores. As most stores are not nontemporal, LLVM uses simple stores when lowering LLVMIR like atomic store ... release on x86, itself a lowering of Rust's AtomicBool::store(.., .., Ordering::Release). These facts could allow something like the following code to be emitted:

vmovntdq [addr],     ymmreg
vmovntdq [addr+N],   ymmreg
vmovntdq [addr+N*2], ymmreg
vmovntdq [addr+N*3], ymmreg
mov byte ptr [flag], 1 ; producer-consumer flag

But these stores are NOT ordered with respect to each other! Nontemporal stores induce the CPU to use write-combining buffers. These writes will be resolved in bursts instead of at once, and the write may be further deferred until a serialization point. Even a "yes-temporal" write to any other location will not force the deferred writes to be resolved first. Thus, assuming cache-line-sized buffers of 64 bytes, the CPU may resolve these writes in e.g. this actual order:

vmovntdq [addr+N*2], ymmreg
vmovntdq [addr+N*3], ymmreg
mov byte ptr [flag], 1
vmovntdq [addr+N],   ymmreg
vmovntdq [addr],     ymmreg

This could e.g. result in other threads accessing this address after the flag is set, thus accessing memory via safe code that was assumed to be correctly synchronized. This could result in observing tearing or other inconsistent program states, especially as the number of writes, thus the number of write buffers that may begin retiring simultaneously, thus the chance of them resolving in an unfortunate order, increases.

To guarantee program soundness, code using nontemporal stores must currently use SFENCE in its safety boundary, unless and until LLVM decides this combination of facts should be considered a miscompilation and motivation to choose lowerings that do not require explicit SFENCE. Even unsafe fn must explicitly pass this invariant to their callers, so it can be preferable for a function to internally "close over" any possible resulting unsoundness.

As one function using streaming stores is used in a fairly tight loop itself, DisplayFrameBuffer::copy_untrusted_row_rgb24_to_bgrx32, name it as a streaming store and pass the invariant up. Then inside copy_from_raw_untrusted_rgb24_to_bgrx32 use SFENCE. This prevents undue performance overhead from repeated SFENCE usage. However, this IS required as some callers immediately do use atomic stores in precisely the way that causes undefined behavior.

MOVNTI, MOVNTDQ, and friends weaken TSO when next to other stores. As
most stores are not nontemporal, LLVM uses simple stores when lowering
LLVMIR like `atomic store ... release` on x86, itself a lowering of
Rust's `AtomicBool::store(.., .., Ordering::Release)`. These facts
could allow something like the following code to be emitted:

```asm
vmovntdq [addr],     ymmreg
vmovntdq [addr+32],  ymmreg
vmovntdq [addr+64],  ymmreg
vmovntdq [addr+96],  ymmreg
mov byte ptr [flag], 1 ; producer-consumer flag
```

But these stores are NOT ordered with respect to each other! Nontemporal
stores induce the CPU to use write-combining buffers. These writes will
be resolved in bursts instead of at once, and the write may be further
deferred until a serialization point. Even a "yes-temporal" write to any
other location will not force the deferred writes to be resolved first.
Thus, assuming cache-line-sized buffers of 64 bytes, the CPU may resolve
these writes in e.g. this actual order:

```asm
vmovntdq [addr+64],  ymmreg
vmovntdq [addr+96],  ymmreg
mov byte ptr [flag], 1
vmovntdq [addr+32],  ymmreg
vmovntdq [addr],     ymmreg
```

This could e.g. result in other threads accessing this address after the
flag is set, thus accessing memory via safe code that was assumed to be
correctly synchronized. This could result in observing tearing or other
inconsistent program states, especially as the number of writes, thus
the number of write buffers that may begin retiring simultaneously,
thus the chance of them resolving in an unfortunate order, increases.

To guarantee program soundness, code using nontemporal stores must
currently use SFENCE in its safety boundary, unless and until LLVM
decides this combination of facts should be considered a miscompilation
and motivation to choose lowerings that do not require explicit SFENCE.
Even `unsafe fn` must explicitly pass this invariant to their callers!

As one function using streaming stores is used in a fairly tight loop
itself, `DisplayFrameBuffer::copy_untrusted_row_rgb24_to_bgrx32`,
rename it as a streaming store and pass up the invariant. Then inside
`copy_from_raw_untrusted_rgb24_to_bgrx32` use SFENCE. This prevents
undue performance overhead from repeated SFENCE usage. However, this
IS required, as some callers immediately do use atomic stores in
precisely the way that causes undefined behavior.
@workingjubilee
Copy link
Author

This code is technically completely untested due to the fact that I don't have a toolchain that supports the targets you use. I would suggest considering using the bare-metal targets instead, for the same reason we removed the *-linuxkernel target upstream, but I'm most certainly not the boss of your repo, so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant