SFENCE after streaming loops #1

workingjubilee · 2023-08-09T10:30:31Z

MOVNTI, MOVNTDQ, and friends weaken TSO when next to other stores. As most stores are not nontemporal, LLVM uses simple stores when lowering LLVMIR like atomic store ... release on x86, itself a lowering of Rust's AtomicBool::store(.., .., Ordering::Release). These facts could allow something like the following code to be emitted:

vmovntdq [addr],     ymmreg
vmovntdq [addr+N],   ymmreg
vmovntdq [addr+N*2], ymmreg
vmovntdq [addr+N*3], ymmreg
mov byte ptr [flag], 1 ; producer-consumer flag

But these stores are NOT ordered with respect to each other! Nontemporal stores induce the CPU to use write-combining buffers. These writes will be resolved in bursts instead of at once, and the write may be further deferred until a serialization point. Even a "yes-temporal" write to any other location will not force the deferred writes to be resolved first. Thus, assuming cache-line-sized buffers of 64 bytes, the CPU may resolve these writes in e.g. this actual order:

vmovntdq [addr+N*2], ymmreg
vmovntdq [addr+N*3], ymmreg
mov byte ptr [flag], 1
vmovntdq [addr+N],   ymmreg
vmovntdq [addr],     ymmreg

This could e.g. result in other threads accessing this address after the flag is set, thus accessing memory via safe code that was assumed to be correctly synchronized. This could result in observing tearing or other inconsistent program states, especially as the number of writes, thus the number of write buffers that may begin retiring simultaneously, thus the chance of them resolving in an unfortunate order, increases.

To guarantee program soundness, code using nontemporal stores must currently use SFENCE in its safety boundary, unless and until LLVM decides this combination of facts should be considered a miscompilation and motivation to choose lowerings that do not require explicit SFENCE. Even unsafe fn must explicitly pass this invariant to their callers, so it can be preferable for a function to internally "close over" any possible resulting unsoundness.

As one function using streaming stores is used in a fairly tight loop itself, DisplayFrameBuffer::copy_untrusted_row_rgb24_to_bgrx32, name it as a streaming store and pass the invariant up. Then inside copy_from_raw_untrusted_rgb24_to_bgrx32 use SFENCE. This prevents undue performance overhead from repeated SFENCE usage. However, this IS required as some callers immediately do use atomic stores in precisely the way that causes undefined behavior.

MOVNTI, MOVNTDQ, and friends weaken TSO when next to other stores. As most stores are not nontemporal, LLVM uses simple stores when lowering LLVMIR like `atomic store ... release` on x86, itself a lowering of Rust's `AtomicBool::store(.., .., Ordering::Release)`. These facts could allow something like the following code to be emitted: ```asm vmovntdq [addr], ymmreg vmovntdq [addr+32], ymmreg vmovntdq [addr+64], ymmreg vmovntdq [addr+96], ymmreg mov byte ptr [flag], 1 ; producer-consumer flag ``` But these stores are NOT ordered with respect to each other! Nontemporal stores induce the CPU to use write-combining buffers. These writes will be resolved in bursts instead of at once, and the write may be further deferred until a serialization point. Even a "yes-temporal" write to any other location will not force the deferred writes to be resolved first. Thus, assuming cache-line-sized buffers of 64 bytes, the CPU may resolve these writes in e.g. this actual order: ```asm vmovntdq [addr+64], ymmreg vmovntdq [addr+96], ymmreg mov byte ptr [flag], 1 vmovntdq [addr+32], ymmreg vmovntdq [addr], ymmreg ``` This could e.g. result in other threads accessing this address after the flag is set, thus accessing memory via safe code that was assumed to be correctly synchronized. This could result in observing tearing or other inconsistent program states, especially as the number of writes, thus the number of write buffers that may begin retiring simultaneously, thus the chance of them resolving in an unfortunate order, increases. To guarantee program soundness, code using nontemporal stores must currently use SFENCE in its safety boundary, unless and until LLVM decides this combination of facts should be considered a miscompilation and motivation to choose lowerings that do not require explicit SFENCE. Even `unsafe fn` must explicitly pass this invariant to their callers! As one function using streaming stores is used in a fairly tight loop itself, `DisplayFrameBuffer::copy_untrusted_row_rgb24_to_bgrx32`, rename it as a streaming store and pass up the invariant. Then inside `copy_from_raw_untrusted_rgb24_to_bgrx32` use SFENCE. This prevents undue performance overhead from repeated SFENCE usage. However, this IS required, as some callers immediately do use atomic stores in precisely the way that causes undefined behavior.

workingjubilee · 2023-08-09T10:46:37Z

This code is technically completely untested due to the fact that I don't have a toolchain that supports the targets you use. I would suggest considering using the bare-metal targets instead, for the same reason we removed the *-linuxkernel target upstream, but I'm most certainly not the boss of your repo, so!

workingjubilee mentioned this pull request Aug 9, 2023

Non-temporal stores (and _mm_stream operations in stdarch) break our memory model rust-lang/rust#114582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFENCE after streaming loops #1

SFENCE after streaming loops #1

workingjubilee commented Aug 9, 2023

workingjubilee commented Aug 9, 2023

SFENCE after streaming loops #1

Are you sure you want to change the base?

SFENCE after streaming loops #1

Conversation

workingjubilee commented Aug 9, 2023

workingjubilee commented Aug 9, 2023