relaxedguide.tex

\documentclass[10]{article}

% standard packages

% A more pleasant font
\usepackage[T1]{fontenc} % use postscript type 1 fonts
\usepackage{textcomp} % use symbols in TS1 encoding
\usepackage{mathptmx,helvet,courier} % use nice, standard fonts for roman, sans and monospace respectively

% Improves the text layout
\usepackage{microtype}

\usepackage{lscape}
\usepackage{fancyhdr}
\usepackage{epsfig}
\usepackage{subfigure}
\usepackage{url}
\usepackage{graphics}
\usepackage{enumerate}
\usepackage{ifthen}
\usepackage{float}
\usepackage{listings}
\lstset{basicstyle=\ttfamily}
% \usepackage[strings]{underscore}
% \usepackage{underscore}
\usepackage[bookmarks=true,bookmarksnumbered=true,pdfborder={0 0 0}]{hyperref}

\lstset{
  literate={\_}{}{0\discretionary{\_}{}{\_}}%
}

\usepackage[table]{xcolor}
\usepackage{booktabs}

\DeclareUrlCommand\email{}

\pagestyle{fancy}
\rhead{}

\newfloat{listing}{tbp}{lol}
\floatname{listing}{Listing}

\begin{document}
\title{P2055R1: A Relaxed Guide to \co{memory_order_relaxed}}

\newcommand{\co}[1]{\lstinline[breaklines=yes,breakatwhitespace=yes]{#1}}

\author{
Hans Boehm\\\email{hboehm@google.com} \and
David Goldblatt\\\email{davidtgoldblatt@gmail.com} \and
Paul E.~McKenney\\\email{paulmck@kernel.org} \and
The Indefatigible TBD
}
\date{April 8, 2024}
\maketitle{}

Audience: SG1 (Informational)

\begin{abstract}
	The out-of-thin-air (OOTA) and read-from-untaken-branch (RFUB)
	properties of the specification of \co{memory_order_relaxed}
	have resulted in considerable consternation over the years.
	Although there are no known instances of full-blown OOTA
	behavior, and no known RFUB-induced failures of production code,
	the theoretical possibility of these properties severely
	complicates automated analysis of large C and C++ code bases.
	Thus far, attempts to eliminate OOTA and RFUB properties from
	the memory model have resulted in otherwise needless added
	overheads on weakly ordered systems on the one hand or
	excessive implementation complexity on the other.
	However, \co{memory_order_relaxed} never was intended to be used
	in arbitrary code, but rather as a part of deliberate application
	of specific concurrency designs.
	This paper forms an initial catalog of patterns underlying such
	designs, with an eye towards identifying bugs in existing code
	and preventing bugs in new code.
\end{abstract}

\section{Background}
\label{sec:Background}

\subsection{Memory Ordering Strength and Expense}
\label{sec:Memory Ordering Strength and Expense}

C++ atomic operations are sequentially consistent by default.
In programs that use only sequentially consistent atomic operations,
these atomics enjoy mathematically appealing interleaving semantics.
However, no computer system in common use in the year 2023 provides
sequential consistency at the hardware level, not even relatively
strongly ordered systems such as x86.
This means that sequentially consistent atomics can be quite expensive,
and especially on weakly ordered systems such as ARM and IBM Power.
Furthermore, a great many use cases in real-world software do not
require sequentially consistent atomics, in fact, the much lower-overhead
release/acquire ordering suffices in a great many situations.

For this reason, C++ provides \co{memory_order} \co{enum} values to
allow the user to control the strength of individual atomic operations.
The \co{memory_order_release} and \co{memory_order_acquire} values
provide the aforementioned release-acquire ordering.

But most relevant to this document, \co{memory_order_relaxed} allows
arbitrary reordering of accesses to distinct locations.
With the exception of the unlamented Itanium CPU family, aligned and
machine-word-sized \co{memory_order_relaxed} loads and stores compile
to unadorned load and store instructions, providing complete control,
excellent efficiency, and stunning scalability.

And in practice, \co{memory_order_relaxed} accesses provide well-understood
ordering properties, which has led to them being heavily used.
Unfortunately, in theory \co{memory_order_relaxed} accesses are subject
to OOTA.
RFUB can occur very rarely in practice, and there is some debate as to
whether this is acceptable.

These issues are discussed in the following section.

\subsection{OOTA and RFUB}
\label{sec:OOTA and RFUB}

There has been considerable work done on OOTA and RFUB over many years,
building on prior work in the Java community, perhaps best exemplified
by the infamous ``Causality Test Cases''.\footnote{
	\url{http://www.cs.umd.edu/~pugh/java/memoryModel/unifiedProposal/testcases.html}.}
There has long been hope that additional research effort will identify
a model of OOTA that all can live with, for example, on the part of Paul,
and that everyone would come to appreciate the relative simplicity of
strengthening \co{memory_order_relaxed} to forbid prior reads to be
reordered with later writes, for example, on the part of
Hans~\cite{Boehm:2014:OGA:2618128.2618134,HansBoehm2019OOTArevisitedAgain,Lahav:2017:RSC:3062341.3062352}.
And progress has been made on both fronts.

On the additional-research front, we now have methods of distinguishing
between OOTA on the one hand and simple reordering on the other.
Unfortunately, one method requires per-scenario
creativity~\cite{PaulEMcKenney2016OOTA},
while others have not as yet been looked upon with favor by compiler
implementers~\cite{Lahav:2017:RSC:3062341.3062352,Sinclair:2017:CAR:3079856.3080206,Lee:10.1145/3385412.3386010,MarkBatty2019ModularRelaxedDependenciesOOTA}.
Backwards-propagating undefined behavior can be especially
troublesome~\cite{DavidGoldblatt2019NoElegantOOTAfix},
but there proposals that this be
restricted~\cite{HansBoehm2020UBalternatives}.
Perhaps such restrictions might eliminate some of the more troublesome
examples of OOTA.

On the reads-before-writes front, there are some indications that newer
weakly ordered hardware incurs reduced penalties for ordering relaxed
reads before relaxed writes, at least for low-end and high-end systems.
However, middle-end systems still incur significant penalties.
This has led to renewed suggestions that a new \co{memory_order_load_store}
member be added to the \co{memory_order} \co{enum}.
It has also led to a renewed proposal that \co{memory_order_relaxed}
be strengthened so as to prohibit reordering of prior loads and subsequent
stores~\cite{LukeGeeson2023TightenRelaxed},
however, this proposal was not universally
loved~\cite{RichardGrisenthwaite2023TightenRelaxedJustSayNo}.

A further complication is that although there is general agreement that
OOTA behaviors must be forbidden, there is some debate on the need to
forbid RFUB behaviors.
Some of those who believe that RFUB behaviors should be allowed
(for example, Paul) argue that the scenarios where RFUB can occur
are contrived and with no known production use cases.
There is general agreement that simple reordering should be allowed, as
it occurs in practice in scenarios that are both useful and commonplace.

\subsection{OOTA Examples}
\label{sec:OOTA Examples}

This section looks at examples of simple reordering, OOTA, and RFUB.

\begin{listing}[tbp]
\begin{verbatim}
 1 atomic<int> x(0);
 2 atomic<int> y(0);
 3
 4 void thread1()
 5 {
 6   unsigned long r1 = x.load(memory_order_relaxed);
 7   y.store(r1, memory_order_relaxed);
 8 }
 9
10 void thread2()
11 {
12   unsigned long r2 = y.load(memory_order_relaxed);
13   x.store(42, memory_order_relaxed);
14 }
\end{verbatim}
\caption{Simple Reordering}
\label{lst:Simple Reordering}
\end{listing}

Listing~\ref{lst:Simple Reordering}
shows an example of simple reordering.
Both the compiler and the CPU are within their rights to reorder
lines~12 and~13, in which case the following sequence of events
will result in all of \co{x}, \co{y}, \co{r1}, and \co{r2} having
the value 42:

\begin{enumerate}
\item	Line~13 stores 42 to \co{x}.
\item	Line~6 loads 42 from \co{x} into \co{r1}.
\item	Line~7 stores \co{r1}, and thus 42, to \co{y}.
\item	Line~12 loads 42 from \co{y} to \co{r2}.
\end{enumerate}

It is strongly and generally agreed that this simple reordering be
allowed.

\begin{listing}[tbp]
\begin{verbatim}
 1 atomic<int> x(0);
 2 atomic<int> y(0);
 3
 4 void thread1()
 5 {
 6   unsigned long r1 = x.load(memory_order_relaxed);
 7   y.store(r1, memory_order_relaxed);
 8 }
 9
10 void thread2()
11 {
12   unsigned long r2 = y.load(memory_order_relaxed);
13   x.store(r2, memory_order_relaxed);
14 }
\end{verbatim}
\caption{Simple OOTA}
\label{lst:Simple OOTA}
\end{listing}

Listing~\ref{lst:Simple OOTA} shows an OOTA example that should be
prohibited.
where where all of \co{x}, \co{y}, \co{r1}, and \co{r2} have final values
of 42:
In actual hardware, this prohibition is enforced by TSO ordering in
strongly ordered systems and by data dependency ordering in weakly
ordered systems.
Compilers cannot store something until after they load it, which
results in instructions being emitted such that the hardware enforcement
applies.
The need to prohibit simple OOTA is one reason why compiler-based
value speculation optimizations are frowned upon.

% @@@ Hans, I omitted your array-based setup, as this involves an
% out-of-bounds array reference, which is itself UB.
% If this example is needed, please help me understand the point that
% it is making.

\begin{listing}[tbp]
\begin{verbatim}
 1 atomic<int> x(0);
 2 atomic<int> y(0);
 3
 4 void thread1()
 5 {
 6   unsigned long r1 = x.load(memory_order_relaxed);
 7   y.store(r1, memory_order_relaxed);
 8 }
 9
10 void thread2()
11 {
12   bool assigned_42 = false;
13   unsigned long r2 = y.load(memory_order_relaxed);
14   if (r2 != 42) {
15     assigned_42 = true;
16     r2 = 42;
17   }
18   x.store(r2, memory_order_relaxed);
19 }
\end{verbatim}
\caption{Simple RFUB}
\label{lst:Simple RFUB}
\end{listing}

Listing~\ref{lst:Simple RFUB}
shows a simple example of RFUB, which again means ``read from untaken
branch''.
The following sequnece of events will result in all of \co{x}, \co{y},
\co{r1}, and \co{r2} having the value 42:

\begin{enumerate}
\item	The compiler notices that at line~17, the value of \co{r2}
	is always 42.
\item	The compiler thus substitutes the value 42 for \co{r2} in
	line~18.
\item	Both the compiler and the CPU are within their rights to reorder
	lines~13 and~18.
\item	Line~18 stores 42 to \co{x}.
\item	Line~6 loads 42 from \co{x} into \co{r1}.
\item	Line~7 stores \co{r1}, and thus 42, to \co{y}.
\item	Line~13 loads 42 from \co{y} to \co{r2}.
\item	Because \co{r2} is equal to 42, lines~15 and~16 are never
	executed, even though it was exactly these lines of code
	that justified the above compiler optimizations.
\end{enumerate}

Many concurrency experts believe that RFUB should be allowed.

\subsection{Classification of Patterns}
\label{sec:Classification of Patterns}

Perhaps agreement on all of these points will be reached, but in the
meantime, \co{memory_order_relaxed} use is increasing, and thus an
increasing need to identify known-safe usage patterns.
In the best case, these usage patterns might be automatically checked
in existing code, but at a minimum we hope that this list will be
useful to code reviewers.
Either way, the goals are to identify bugs in existing code and to
help avoid bugs in new code.

This paper is a first step toward such a set of patterns.

The term \emph{full C++} refers to the C++20 memory model as stated in
the current draft.
The term \emph{strict C++} refers to the subset of full C++ obtained
by dropping the following normative encouragement from the C++20 memory
model:
``Implementations should ensure that no "out-of-thin-air" values are
computed that circularly depend on their own computation.''
Some (but not all) of the proto-patterns in this document are safe in
strict C++, but all of them are safe in full C++.

\section{Relaxed Design Patterns}
\label{sec:Relaxed Design Patterns}

Many of these patterns are taken from Hans's \co{memory-model-design}
posting on September 4, 2018.\footnote{
	Message-ID: \co{<CAMOCf+jchGw6DeE2NyCJA3wfFbNH-WFn59JruZPSWt9_jPW9NQ@mail.gmail.com>}.}

In the examples, \co{x} and \co{y} denote potentially shared locations,
while \co{r1} and \co{r2} denote local variables (``registers'') whose
addresses are not taken.

\subsection{Non-Racing Accesses}
\label{sec:Non-Racing Accesses}

Any non-racing access to an atomic object can be a relaxed access.
Because the access is not concurrent with a conflicting access (store
against either store or load), further ordering is unnecessary.\footnote{
	This covers case~\#8 in Hans's September 4, 2019 email.}
In fact, such accesses can in theory be non-atomic.
In environments where atomicity is controlled by the access rather
than the object definition, such accesses are often non-atomic in
practice~\cite{JadeAlglave2019WhoAfraidCompiler}.

\begin{listing}[tbp]
\begin{verbatim}
 1 int x = 0;
 2 int y = 0;
 3
 4 void thread1()
 5 {
 6   if (x)
 7     y = 1;
 8 }
 9
10 void thread2()
11 {
12   if (y)
13     x = 1;
14 }
\end{verbatim}
\caption{Non-Atomic Accesses Sometimes Respect Control Dependencies}
\label{lst:Non-Atomic Accesses Sometimes Respect Control Dependencies}
\end{listing}

For example, given concurrent execution of \co{thread1()} and
\co{thread2()} in
Listing~\ref{lst:Non-Atomic Accesses Sometimes Respect Control Dependencies},
the only permitted outcome results in both \co{x} and \co{y} being
equal to zero in both full C++ and strict C++.
Any other outcome would violate the ``sequential consistency for
data race free programs'' principle, and must effectively be due to a
compiler-created data race, which is forbidden.

\begin{listing}[tbp]
\begin{verbatim}
 1 std::atomic<int> x = 0;
 2 std::atomic<int> y = 0;
 3
 4 void thread1()
 5 {
 6   if (x.load(memory_order_relaxed))
 7     y.store(1, memory_order_relaxed);
 8 }
 9
10 void thread2()
11 {
12   if (y.load(memory_order_relaxed)
13     x.store(1, memory_order_relaxed);
14 }
\end{verbatim}
\caption{Strict C++ Does Not Require Atomics to Respect Control Dependencies}
\label{lst:Strict C++ Does Not Require Atomics to Respect Control Dependencies}
\end{listing}

In contrast, in the analogous program using C++ atomics
(see Listing~\ref{lst:Strict C++ Does Not Require Atomics to Respect Control Dependencies}),
additional behaviors are permitted by strict C++,
including the one resulting in the final values of both \co{x}
and \co{y} being \co{1}.
The restriction to ``strict C++'' is important because this code fragment
is considered to be an example of the OOTA behavior that is forbidden
by the normative encouragement in that same standard.

In short, although any non-racing access to an atomic object may be
relaxed, strict C++ counter-intuitively classifies many access patterns
as racy.

\begin{listing}[tbp]
\begin{verbatim}
 1 if (!x_init.load(memory_order_acquire)) {
 2   lock_guard<mutex> _(x_init_mtx);
 3   if (!x_init.load(memory_order_relaxed)) {
 4     initialize(&x);
 5     x_init.store(true, memory_order_release);
 6   }
 7 }
\end{verbatim}
\caption{Double-Checked Locking}
\label{lst:Double-Checked Locking}
\end{listing}

Relaxed atomics can nevertheless be useful for non-racing accesses
in real-life situations, as can be seen in the double-checked
locking example shown in
Listing~\ref{lst:Double-Checked Locking}.
Line~1 uses an acquire load from \co{x_init} to check whether
initialization is needed, in which case line~2 acquires \co{mutex}
using an anonymous \co{lock_guard} (anonymity being designated by the
underscore where a local-variable name would be expected).
Once this lock is held, line~3 uses a relaxed load to recheck \co{x_init}
to see whether initialization is still needed.
A relaxed load works here because holding the lock prevents other
threads from storing to \co{x_init}.
Line~4 carries out the initialization, and line~5 updates \co{x_init}
to indicate that initialization is complete.

When the acquire load on line~1 returns the value \co{true}, that
load synchronizes with the release store on line~5, guaranteeing
that any code following line~7 sees the results of that initialization.

These patterns can be at least partially checked using data-race
detectors, both static and runtime.

\subsubsection{Initialization and Cleanup}
\label{sec:Initialization and Cleanup}

Important special cases of this pattern are the single-threaded
initialization and cleanup phases of an otherwise concurrent program.
These use cases are one motivation for the strong ordering guarantees
of thread creation and destruction.
These guarantees permit the single-threaded initialization and cleanup
code to run race free, with no need to consider interference from the
intervening code that runs multithreaded.

\subsubsection{Lock-Based Critical Sections}
\label{sec:Lock-Based Critical Sections}

Exclusive locks provide mutual exclusion, so that objects accessed
only while holding a given lock may be accessed using
\co{memory_order_relaxed} accesses, or, for that matter, using
non-atomic accesses.

Reader-writer locks provide a weaker form of mutual exclusion,
However, objects that are updated only while a given reader-writer
lock is write-held and read only when that same lock is either
read-held or write-held may also be accessed using
\co{memory_order_relaxed} accesses, or, again, using non-atomic accesses.

Of course, non-atomic accesses are almost always used with pure locking.
However, \co{memory_order_relaxed} accesses are sometimes quite useful,
for example, in cases where objects pass through a software pipeline,
where one stage uses pure locking and another stage relies on atomic
operations.

% @@@ Hans: Similar patterns exist for RCU and for hazard pointers.
%	I am definitely not adding these this close to the deadline,
%	but perhaps the next version.

\subsection{Single-Location Data Structures}
\label{sec:Single-Location Data Structures}

Relaxed atomic operations provide sequentially consistent access to
a single object.
This means that data structures that fit into a single object can
be accessed with relaxed atomics with no possibility of OOTA or
RFUB behavior.

Note well that a group of single-location data structures might well
interact in a way that could raise the spectre of OOTA or RFUB.
As before, design review should therefore pay careful attention to
information flow.

These patterns can be checked by verifying that no store to another shared
variable is affected by the value of the single-location data structure,
unless that value can be shown not to affect that same single-location
data structure, for example, if that other shared variable is part of
a unidirectional data flow
(see Section~\ref{sec:Unidirectional Data Flow}).

\subsection{Shared Fences}
\label{sec:Shared Fences}

The \co{atomic_thread_fence()} function can be used to order multiple
accesses.

For example, consider a series of acquire loads that are intended to provide
order against subsequent accesses, but not against each other.
Because the compiler will normally not be able to determine that the
acquire loads need not be ordered against each other, on some platforms
this will result in a memory-fence instruction being emitted after each
and every acquire load, when only the last fence is required.
These unnecessary fences can be avoided by replacing the acquire
loads with relaxed loads followed by a single
\co{atomic_thread_fence(memory_order_acquire)}~\cite[Section 4.1]{RaulSilvera2007WeakMemoryModel}.
This will have the desired effect of ordering all of these loads with any
subsequent accesses, while also avoiding the overhead of the redundant
fence instructions that would be expected from the acquire loads.

Similarly, consider a series of release stores that are intended to provide
order against prior accesses, but not against each other.
Again, the compiler might emit a memory-fence instruction before
each of the stores, when only the first fence is required.
These unnecessary fences can be avoided by replacing the release
stores with relaxed stores preceded by a single
\co{atomic_thread_fence(memory_order_release)} followed by
a series of relaxed stores~\cite[Section 4.2]{RaulSilvera2007WeakMemoryModel}.
This will have the desired effect of ordering all of these stores with any
prior accesses, while also avoiding the overhead of any redundant
fence instructions emitted for the release stores.

In many cases, other ordered atomic operations may be substituted for
the \co{atomic_thread_fence()} operations.
For example, \co{synchronize_rcu()}
(also known as \co{rcu_synchronize()} in recent C++ working papers) implies
the semantics of \co{atomic_thread_fence(memory_order_acqrel)}.
In addition, thread creation and thread join provide the ordering
semantics of:

\begin{itemize}
\item	\co{atomic_thread_fence(memory_order_release)} for thread creation.
\item	\co{atomic_thread_fence(memory_order_acquire)} for the initialization of the
	corresponding thread.
\item	\co{atomic_thread_fence(memory_order_release)} for the termination of the
	corresponding thread.
\item	\co{atomic_thread_fence(memory_order_acquire)} for the join operation
	of the corresponding thread.
\end{itemize}

It is of course more conventional to consider thread creation to
synchronize with the created thread and thread termination to synchronize
with the corresponding join operation.
This leads to the final example, which is that the shared-fences pattern
also applies to any pair of calls to library where one synchronizes with
the other.

In this design pattern, OOTA and RFUB behaviors are ruled out by the semantics
of \co{atomic_thread_fence()} or by synchronizes-with, as the case may be.

This pattern can in theory be checked by verifying that all accesses
are associated with a fence, however, current checker technology likely
requires that the associated fence be explicitly called out.

\subsection{Atomic Reference-Count Updates}
\label{sec:Atomic Reference-Count Updates}

In certain reference-count use cases, the ordering of the increments and
decrements is irrelevant.
One common case is where it is only legal to increment the reference
count when the incrementing thread already holds a reference, in which
case the count cannot possibly decrease to zero in the meantime.
Because only the one-to-zero transition requires ordering, reference-count
increments can be relaxed in cases where another reference is guaranteed
to be held throughout.

Similarly, reference-count decrements can also be relaxed, but only if
the thread will still hold at least one reference after the decrement.
In other words, a thread releasing its last reference is forbidden
from using a relaxed operation to do so, because in that case there
is no guarantee that another reference is guaranteed to be held
throughout.\footnote{
	More elaborate variants of this pattern allow these rules to
	be relaxed.
	For example, if a parent thread is guaranteed not to release
	its last reference until after joining with its child threads,
	then those child threads may use relaxed decrements to release
	their final reference.}

This pattern can be checked in conjunction with lock dependency checkers
such as the Linux kernel's lockdep facility~\cite{JonathanCorbet2006lockdep}.

We suspect that this is an example of a more general class of patterns,
but other examples of such a class do not immediately come to mind.
One can of course imagine things like preprocessed sensor values where
these values are irrelevant except in their relation to cutoff values.
We would welcome examples used in actual code.

\subsection{Untrusted Loads}
\label{sec:Untrusted Loads}

In many cases, it is acceptable for a load from an atomic shared variable
to occasionally return random bits because the value is checked by
some later operation.
In such cases, the load can be a relaxed load.

\begin{listing}[tbp]
\begin{verbatim}
 1 unsigned long expected = x.load(memory_order_relaxed);
 2 while (!x.compare_exchange_weak(expected, f(expected))
 3   continue;
\end{verbatim}
\caption{Untrusted Load Checked by CAS}
\label{lst:Untrusted Load Checked by CAS}
\end{listing}

Listing~\ref{lst:Untrusted Load Checked by CAS}
demonstrates this pattern.
If misordering, OOTA, or RFUB were to cause line~1 to return a bogus
value, then the ~\co{compare_exchange_weak()} on line~2 would fail,
implicitly re-loading the value and retrying.

\subsubsection{Pre-Load for Compare and Swap}
\label{sec:Pre-Load for Compare and Swap}

Perhaps the most well-known later checking operation is a non-relaxed
compare-and-swap (CAS).
The \co{atomic_compare_exchange_*()} family of read-modify-write
CAS operations are typically used in a loop, and often require an initial
load prior to the first pass through that loop.
For non-relaxed CAS operations, this initial load can typically be a
relaxed load, with the CAS operation's ordering preventing OOTA and RFUB
behaviors.
Relaxed CAS operations need to be part of some other design pattern
(for example, the shared fences pattern called out in
Section~\ref{sec:Shared Fences})
if cycles containing them are to be guaranteed to be OOTA/RFUB-free in
conjunction with an initial relaxed load.
One common design pattern is the single-location data structure discussed in
Section~\ref{sec:Single-Location Data Structures}.

Additional examples are presented by
Sinclair et al.~\cite{Sinclair:2017:CAR:3079856.3080206}.

This pattern can be checked by verifying that the values from the relaxed
loads propagate only to a CAS operation.

\subsubsection{Sequence Locking}
\label{sec:Sequence Locking}

The accesses within a sequence-locking read-side critical section
can used relaxed loads because any concurrency with the corresponding
update will result in a retry, thus discarding any loaded values.
Assuming that sequence-locking readers never store to shared memory,
this not only prevents the surfacing of any OOTA or RFUB cycles, but
also of any other non-SC behaviors.

Note that a proposal\footnote{
	\url{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1478r0.html}}
provides an \co{atomic_load_per_byte_memcpy()} that allows safe
non-atomic access to data for sequence-lock readers, as well as an
\co{atomic_store_per_byte_memcpy()} to update that same data by
sequence-lock updaters.
It is nevertheless quite possible that some sequence-lock readers
might continue to use relaxed atomics in order to permit reliable
computations within readers in the presence of data objects having
trap representations.

Furthermore, sophisticated sequence-locking use cases may need to
use relaxed accesses for other reasons.
For example, the Linux kernel's lockless path-to-\co{inode} traversal uses
the closely related sequence counters to detect large-scale changes
to the filesystem tree that would otherwise confuse this
traversal~\cite{NeilBrown2015PathnameLookup,NeilBrown2015RCUwalk}.
Such confusion could result is a number of anomalies, including successful
lookup of paths that never actually existed.

This pattern can be checked by enlisting the aid of lock dependency
checkers to verify that the access is within the scope of a sequence
lock reader.
Checking that the value does not leak out of that sequence lock is
more difficult.

\subsection{Unidirectional Data Flow}
\label{sec:Unidirectional Data Flow}

If data flows only in one direction, then OOTA cycles cannot form.
The following sections give several examples of this general design
pattern.

\subsubsection{Independent Input Data}
\label{sec:Independent Input Data}

Input data consisting of independent objects may be read using relaxed
accesses because these objects are not affected by downstream computations.
Here input data is defined broadly, including:

\begin{enumerate}
\item	Measurements of outside environmental conditions.
\item	Device configuration data.
\item	Software configuration data.
\item	Security policies.
\item	Network routing information.
\end{enumerate}

The key point is that the concurrent-computation portion of application
references but does not modify this data, and that there are no
object-to-object consistency constraints.

\subsubsection{Independent Output Data}
\label{sec:Independent Output Data}

Similarly, output data consisting of independent objects may be written
using relaxed accesses because these objects do not affect upstream
computations.
As before, output data is defined broadly, including:

\begin{enumerate}
\item	Control of objects external to the computer.
\item	Many classes of debug output.
\item	Some use cases involving video frame buffers.
\item	Some use cases involving communication to a later stage of a
	software pipeline.
\end{enumerate}

Similar to the independent input data discussed in the preceding section,
the key point is that the concurrent-computation portion of application
modifies but does not reference this data, and that there are no
object-to-object consistency constraints.

\begin{listing}[tbp]
\begin{verbatim}
 1 atomic<int> s1(0);
 2 atomic<int> s2(0);
 3
 4 void thread1()
 5 {
 6   int s1 = get_ext_state(1);
 7   int s2 = get_ext_state(2);
 8   cs1.store(reduce_state(s1), memory_order_relaxed);
 9   cs2.store(reduce_state(s2), memory_order_relaxed);
10 }
11
12 void thread2()
13 {
14   int c = compute_ctl(cs1.load(memory_order_relaxed,
15                       cs2.load(memory_order_relaxed));
16   set_ext_ctl(c);
17 }
\end{verbatim}
\caption{Unidirectional I/O Data Flow}
\label{lst:Unidirectional I/O Data Flow}
\end{listing}

Listing~\ref{lst:Unidirectional I/O Data Flow}
combines item~2 from the lists in
Sections~\ref{sec:Independent Output Data}
and~\ref{sec:Independent Input Data}.
The \co{thread1()} input data flows from the \co{get_ext_state()} functions
through the \co{reduce_state()} functions into the \co{cs1} and \co{cs2}
shared variables.
The \co{thread2()} output data flows from these same \co{cs1} and \co{cs2}
shared variables though \co{compute_ctl()} and finally is output
by \co{set_ext_ctl()}.
The data flow is unidirections from input to output, so no OOTA cycles
can form.

\subsubsection{Statistical Counters}
\label{sec:Statistical Counters}

The canonical instance of a unidirectional data-flow pattern is the
statistical counter, in which each thread (or CPU, as the case may be)
updates its own counter, and the aggregate value of the counter is read
out by summing all threads'
counters~\cite[Section 5.2]{McKenney2018ParallelProgramming-2018-12-08a}.

Statistical counters do have concurrent updates and reads, and thus must
use atomics.
However, the concurrent reads can be modeled as returning approximate
results (for example, for monitoring or debugging), and can in fact be
modeled as sequentially consistent approximate operations.
But more to the point, data flow in real use cases is always
unidirectional, proceeding from the updater responding to an event
and flowing through the counter to some reader displaying or logging
statistics.
This unidirectional data flow precludes the cycles required for OOTA or
RFUB behavior to manifest.

\begin{listing}[tbp]
\begin{verbatim}
 1 StatCounter<unsigned long> a;
 2 StatCounter<unsigned long> b;
 3
 4 void thread1()
 5 {
 6   unsigned long r1 = a.readout();
 7   b.increase(r1);
 8 }
 9
10 void thread2()
11 {
12   unsigned long r2 = b.readout();
13   a.increase(r2);
14 }
\end{verbatim}
\caption{Statistical-Counter Abuse and OOTA}
\label{lst:Statistical-Counter Abuse and OOTA}
\end{listing}

An example abuse is shown in
Listing~\ref{lst:Statistical-Counter Abuse and OOTA}.
Lines~1 and~2 define a pair of statistical counters \co{a} and \co{b}.
The \co{thread1()} and \co{thread2()} functions form a classic
data-dependent OOTA cycle.
Assuming both statistical counters start out with all counters zero,
we could in theory see the following OOTA sequence of events:

\begin{enumerate}
\item	Line~6 sums \co{a}'s counters, obtaining the sum 42.
\item	Line~7 increases the current component of \co{b}'s counter by 42.
\item	Line~12 sums \co{b}'s counters, obtaining the sum 42 due to the
	increase from line~7.
\item	Line~13 increases the current component of \co{a}'s counter by 42,
	thus justifying the sum of 42 obtained by line~6.
\end{enumerate}

Of course, the code in
Listing~\ref{lst:Statistical-Counter Abuse and OOTA}
is complete nonsense: Counters should count events, not each others's
cumulative values.
The code as written is about as useful as the proverbial screen door in
a submarine.
Problems of this sort should be located in a code review, or better yet
during the preceding design review.\footnote{
	Yes, this could be considered analogous to a difference-equation
	control system.
	But in that case, the system being controlled is part of the
	loop, and proper synchronization must be used when communicating
	with that system.
	In addition, the actual difference-equation computation will
	normally be single-threaded.
	More importantly, if the system being controlled might pose a threat
	to life and limb, the design review had jolly well better be
	sufficiently well-informed and thorough as to avoid this sort
	of problem.}

Exact values are sometimes obtained from statistical counters in
stop-the-world situations, such as checking for consistent results
at the end of a stress test or benchmarking
run~\cite[Sections 5.3 and 5.4]{McKenney2018ParallelProgramming-2018-12-08a}.
Alternatively, counter updates might be carried out while read-holding
a given reader-writer lock and counter reads while write-holding
that same lock.
In all of these cases, OOTA and RFUB behaviors are additionally avoided due
to the fully synchronized nature of the readout.

\subsubsection{Software Pipelines}
\label{sec:Software Pipelines}

Software pipelines break computation up into stages that might proceed
concurrently.
If the interface between a consecutive pair of stages is simple enough,
relaxed accesses might be used for the corresponding communication of data.
Pipelines are not necessarily strictly linear, in fact it can be quite
advantageous to have concurrent stages feeding into a single subsequent
stage via a reduction step.
If the output of the concurrent stages is sufficiently simple, the
reduction step might be a simple relaxed atomic fetch-and-op operation
to a single scalar object.
An example of a sufficiently simple output is event counts emanating from
concurrent stream processing feeding into later sequential logic.

Note that the independent input and output data patterns discussed in
Sections~\ref{sec:Independent Input Data} and~\ref{sec:Independent Output Data}
might be endpoints of a software pipeline.

\subsubsection{Owner Field for Re-Entrant Mutex}
\label{sec:Owner Field for Re-Entrant Mutex}

This pattern is first analyzed for full C++, and then for strict C++.
Spoiler warning:  There is reason to believe that this pattern works
in full C++, but not in strict C++.

\begin{listing}[tbp]
\begin{verbatim}
 1 class my_reentrant_mutex {
 2   std::mutex m;
 3   // Writes of owner are protected by m.
 4   // Only owner writes or clears its id.
 5   std::atomic<std::thread::id> owner; // id() if not held
 6   // Protected by m.
 7   int count;  // Held count-1 times by owner
 8   . . .
 9 }
10
11 void my_reentrant_mutex::lock() {
12   std::thread::id me = std::this_thread::get_id();
13   // No other thread can change whether owner == me.
14   if (owner.load(memory_order_relaxed) == me) {
15     ++mutex.count; // Done; reacquired the lock.
16   } else {
17     ... // Acquire m, leaving count == 0
18     owner.store(me, memory_order_relaxed);
19   }
20 }
\end{verbatim}
\caption{Re-Entrant Mutex Owner Field}
\label{lst:Re-Entrant Mutex Owner Field}
\end{listing}

Pseudo-code for a re-entrant exclusive mutex is shown in
Listing~\ref{lst:Re-Entrant Mutex Owner Field}.
Each mutex must track its owner (\co{owner} on line~5) in order to avoid
self-deadlock when the owner re-acquires a mutex that it already holds.
This owner field is updated only while the mutex is held (line~18), and
its value is used only to compare for equality to the current thread's ID.
Before releasing the mutex, the owner writes a special ID to the owner
field that is guaranteed not to match the ID of any thread.
Other threads can access the owner concurrently with the owner's
update, so the owner field must be atomic in order to avoid data races.
Of course, a nesting counter (\co{count} on line~7) must also be used
in order identify the outermost lock-release operation, however this
counter is accessed only by the thread currently holding the lock
(line~15).
Therefore, if the lock works correctly, exclusive access will be provided
to the nesting counter, as is required.
Those wishing to produce a proof of correctness are encouraged to try
induction.

However, the only time that the owner field can be equal to the thread ID
is when that thread carried out the last update to the owner field and
still holds the mutex:

\begin{enumerate}
\item	Each thread writes only its ID or the special ID.
\item	Because \co{memory_order_relaxed} loads are single-variable
	SC, and because each thread sets the owner field to the special
	ID before releasing the mutex, a given thread cannot see its own
	ID unless it still holds the mutex.
\item	Because atomic accesses forbid load tearing, each load from
	the owner field will return either the special ID or the thread
	ID corresponding to some thread that recently held the mutex.
\item	Therefore, when a thread is not holding the mutex, it is guaranteed
	not to load its own ID from the owner field.
\end{enumerate}

No other thread is allowed to write to the owner field while the mutex
is held, so it is impossible to form the cycles required for OOTA or
RFUB behavior to manifest.
Therefore, both the reads from and the writes to the owner field
may use \co{memory_order_relaxed}.

This is a special case of unidirectional data flow, with the data flowing
from the mutex holder to threads not holding the mutex.
The mutual exclusion provided by the mutex prevents any OOTA or RFUB
cycles from forming.

% More generally, if an object is written only while a given mutex is
% held, all accesses to that object may be relaxed without the possibility
% of OOTA or RFUB behaviors.
% --- If all writes to all objects are protected by a given mutex, maybe.

However, things might well be more difficult in strict C++.

These potential difficulties stem from the possibility of undefined
behavior (UB) back-propagating through a cycle so as to justify the OOTA
behavior~\cite{DavidGoldblatt2019NoElegantOOTA}.
To see the rationale for this back-propagating self-justifying UB (BPSJUB?),
consider a pair of threads each concurrently attempting to acquire a
mutex, but where (incorrectly and inconsistently) each see that the
owner field matches their respective thread IDs.
Both threads would then simultaneously execute within their respective
critical sections, which could result in UB.
UB can back-propagate in time, which could quite possibly result
in the threads each seeing their own values in the owner field, which
is what instigated the UB in the first place.\footnote{
	Full disclosure: Paul wrote this paragraph, and he is not
	completely sold on back-propagating self-justifying UB.  The
	critical reader should therefore review both David Goldblatt's
	working paper~\cite{DavidGoldblatt2019NoElegantOOTA} and Hans
	Boehm's recent proposal~\cite{HansBoehm2020ConcurrentUB}.}

% @@@ Hans: How can memory_order_load_store help?  If all of the
%	accesses in the critical section are non-atomic, then
%	there is nothing for that memory_order_load_store access
%	to order with!  Yes, the the compiler is required to
%	avoid introducing data races involving those non-atomics,
%	but all that the compiler is required to do in that case
%	is to avoid hoisting those accesses above the owner-field
%	load.  What am I missing here?

Therefore, developers and reviewers should assume that owner fields for
re-entrant mutexes require full C++ in order to operate correctly.

\subsubsection{One-Way Memory Allocation}
\label{sec:One-Way Memory Allocation}

One-way memory allocation provides fresh memory that is never deallocated,
or that is deallocated using a heavy weight one-sided mechanism, for
example, a stop-the-world deallocation phase.
Such an allocator might use a relaxed atomic fetch-and-add operation
on a shared pointer to allocate memory from a contiguous buffer.
This pointer would be initialized to reference the beginning of
the buffer, and each fetch-and-add operation would add the desired
allocation size (perhaps rounded up to meet alignment constraints),
returning the initial value of the pointer and leaving the pointer
referencing the portion of the buffer following the just-completed
allocation.

The semantics of the C++ fetch-and-add operation guarantees that
data flows from one runtime operation to the next, acyclicly.
Therefore, OOTA cycles cannot be formed on this type of allocator's
pointer manipulation alone.\footnote{
	This pattern can also be considered to be a
	single-location data structure, as discussed in
	Section~\ref{sec:Single-Location Data Structures}.}
However, OOTA cycles can form based on the pointer values returned from
such an allocator and from dereferences of these pointers.
Adventurous readers can find an early drafty draft analysis of this
situation (but on a more complicated allocator) in
Appendix~\ref{sec:Allocator Caches}.

In the meantime, code reviewers should view relaxed stores of pointers to
newly allocated objects with great suspicion.

\subsubsection{Relaxed Consumption}
\label{sec:Relaxed Consumption}

In cases where a full-speed \co{memory_order_consume} is needed on a
weak-memory system and where the developers are willing to live within
strict coding standards~\cite{PaulEMcKenney2014rcu-dereference},
\co{memory_order_relaxed} may be used to head dependency chains.
In many (but not all!) use cases, the data flow is also unidirectional,
proceeding from the thread installing the new object to the threads
consuming it.

Note well that this design pattern is outside of the current standard.

All of these patterns can be checked by looking for cycles in a dataflow
that has been marked unidirectional.

\subsection{Java-Style Lazy Scalar Initialization}
\label{sec:Java-Style Lazy Scalar Initialization}

Given the ``Java'' in the title, it is only natural to ask why this
applies to C or C++.
The answer is simple:
It applies because this portion of Java is written in C++.\footnote{
	Documented here:
	\url{https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html}.}

\begin{listing}[tbp]
\begin{verbatim}
 1 atomic<int> x(0);
 2 atomic<int> y(0);
 3
 4 void thread1()
 5 {
 6   long r1 = x.load(memory_order_relaxed);
 7   if (r1 == 0) {
 8     r1 = pure_function();
 9     x.store(r1, memory_order_relaxed);
10   }
11   // r1 is now trusted
12   if (is_bad(r1))
13     bad_behavior();
14 }
15
16 void thread2()
17 {
18   long r2 = y.load(memory_order_relaxed);
19   if (r2 == 0) {
20     r2 = pure_function();
21     y.store(r2, memory_order_relaxed);
22   }
23   // r2 is now trusted
24   if (is_bad(r2))
25     bad_behavior();
26 }
\end{verbatim}
\caption{Lazy Scalar Initialization}
\label{lst:Lazy Scalar Initialization}
\end{listing}

Java-style lazy scalar initialization can be used to track values that
are expensive to compute and not known until runtime on the one hand,
but immutable and deterministic on the other.
A natural way to handle this situation is to have each access check
to see if the desired value has already been computed, and, if not,
compute the value and store it for later use, as shown in
Listing~\ref{lst:Lazy Scalar Initialization}.
Of course, it is possible that two threads might concurrently load
the initial not-yet-computed value, in which case, both threads
will compute the value and store it.
This does waste CPU time, but this waste is often greatly outweighed by
reduced synchronization cost.
This reduced synchronization cost is due to the \co{memory_order_relaxed}
loads and stores used to access the value.
After all, the fact that the values are deterministic means that
the two threads will be storing the same value, so strongly ordered
stores provide no benefit.
Furthermore, any other thread is guaranteed to see either the
before-computation initial value or the exact same computed value.

Given that each value is deterministic, there cannot be a cyclic chain
of interdependent values, so this pattern works in both full C++ and in
strict C++.

This pattern must be used with caution in cases where the value is
a pointer to allocated memory, especially if that memory is allocated
and initialized at runtime.
First, racing initializations will result in one-time memory leaks.
Second, a store of a pointer to recently initialized memory should
be a \co{memory_order_release} store or stronger, and the corresponding
loads should be \co{memory_order_consume} loads or stronger.
Therefore, this pattern might not be helpful when the immutable and
deterministic values are linked data structures, but it is often used
for scalar values.

This pattern can be checked by looking for the check-compute-store
pattern of accesses.

\subsection{Chaotic Relaxation}
\label{sec:Chaotic Relaxation}

There are a number of iterative numerical algorithms for which unsynchronized
access does not slow convergence as much as waits for barrier synchronization.
These algorithms can use relaxed loads and stores to update the numerical
data~\cite{Andrews91textbook}.
The idea is that the iterative convergence tests correct any small errors
due to accessing data from some other iteration.

In theory, these algorithms are subject to OOTA and RFUB behaviors.
For example, one thread might speculate a NaN\footnote{
	IEEE floating point ``not a number''.},
which might result in a store of that NaN which might in turn justify
some other thread's speculation of a NaN, which could finally justify
the first thread' initial speculation.
However, in practice, current implementations avoid such behaviors.

How to check this pattern?

\subsection{Garbage Collection}
\label{sec:Garbage Collection}

By definition, concurrent garbage collectors read pointers in th
user's heap while the application is running.
On weakly ordered machines, such accesses {\it must} use
\co{memory_order_relaxed} accesses, since anything else would require
all pointer accesses by the user program to be ordered, which is usually
far too expensive.
Since such collectors often both read and write heap pointers, it is
currently difficult or impossible to strictly preclude OOTA in strict C++.
However, neither OOTA nor RFUB behavior has been observed in practice.

Note that although such garbage collection for C++ is rare, such garbage
collectors, e.g. for Java, are often implemented in C++.

% @@@ Does use of things like \co{sys_membarrier()} alleviate this issue
% by providing strong ordering without impeding allocation and traversal
% in the common case?  @@@
% I don't think so. There's not a natural place to put a sys_membarrier.
% Performance on both sides is critical. Depending on the collector, there
% is also a consume-like issue. The main issue is again that if the collector
% speculatively reads a non-pointer, and thinks it's a pointer, it may write
% random locations, potentially justifying the bad speculation.
% @@@ I was afraid of that...

How can this pattern be checked?

\section{Attributes of Relaxed Design Patterns}
\label{sec:Attributes of Relaxed Design Patterns}

\begin{table}
\renewcommand*{\arraystretch}{1.2}
\newcommand{\x}{\textcolor{gray!20}{\rule{7pt}{7pt}}}
\newcommand{\rothead}[1]{\begin{picture}(6,70)(0,0)\rotatebox{90}{#1}\end{picture}}
\small
\centering
\begin{tabular}{lcccccccc}
	\toprule
	& \rothead{Multiple Threads}
	& \rothead{Concurrent WW}
	& \rothead{Concurrent RW}
	& \rothead{~~~~But Checked}
	& \rothead{~~~~But Discarded}
	& \rothead{~~~~But Fungible}
	& \rothead{Unordered Cycle}
	& \rothead{Strict C++ Safe}
	\\
%				  MT  CWW   CRW    BC   BD   FV    UC SCS
	\cmidrule(r){1-1} \cmidrule{2-9}
	Non-Racing Accesses (Section~\ref{sec:Non-Racing Accesses})
				&  Y & \x &  \x  & \x & \x & \x &  \x &  Y \\
	Single-Location Data Structures (Section~\ref{sec:Single-Location Data Structures})
				&  Y &  Y &   Y  & \x & \x & \x &  \x &  Y \\
	Shared Fences (Section~\ref{sec:Shared Fences})
				&  Y &  Y &   Y  & \x & \x & \x &  \x &  Y \\
	Atomic Reference-Count Updates (Section~\ref{sec:Atomic Reference-Count Updates})
				&  Y &  Y &   Y  & \x & \x &  Y &  \x &  Y \\
	Untrusted Loads (Section~\ref{sec:Untrusted Loads})
				&  Y & \x &   Y  &  S &  S &  S &  \x &  Y \\
	Unidirectional Data Flow (Section~\ref{sec:Unidirectional Data Flow})
				&  Y &  Y &   Y  & \x & \x & \x &  \x &  Y \\
	Reentrant Mutex (Section~\ref{sec:Owner Field for Re-Entrant Mutex})
				&  Y & \x &   Y  & \x & \x & \x &   Y & \x \\
	Java-Style Lazy Scalar Initialization (Section~\ref{sec:Java-Style Lazy Scalar Initialization})
				&  Y &  Y &   Y  & \x & \x &  Y &  \x & \x \\
	Chaotic Relaxation (Section~\ref{sec:Chaotic Relaxation})
				&  Y &  Y &   Y  &  Y & \x & \x &   Y &  S \\
	Garbage Collection (Section~\ref{sec:Garbage Collection})
				&  Y &  Y &   Y  & \x & \x & \x &   S &  S \\
%	Mutex Owner Field	&  Y & \x &   Y  &  Y & \x & \x &  \x \\
%	Statistical Counters	&  Y & \x &   Y  & \x & \x & \x &  \x \\
%	Single-Threaded
%	   Processing Phases	& \x & \x &  \x  & \x & \x & \x &  \x \\
%	One-Way Memory
%	   Allocation		&  Y &  Y &  \x  & \x & \x & \x &  \x \\
%	Allocator Caches	& \x & \x &  \x  & \x & \x & \x &  \x \\
%	Sequence Locking	&  Y & \x &   Y  & \x &  Y & \x &  \x \\
	\bottomrule
\end{tabular}
\caption{Attributes of Categories of Relaxed Design Patterns}
\label{tab:Attributes of Categories of Relaxed Design Patterns}
\end{table}

Table~\ref{tab:Attributes of Categories of Relaxed Design Patterns}
shows attributes of design patterns.
Cell with ``Y'' indicate ``yes'', with ``S'' indicate ``sometimes'',
and otherwise indicate ``no''.
Please note that this table assumes that back-propagating self-justifying
undefined behavior is prevented.
The attributes are as follows:

\begin{enumerate}
\item	\emph{Multiple Threads:}  ``Y'' indicates that the design pattern
	uses multiple threads in and of itself.
	Note that ostensibly single-threaded patterns often interact
	with other patterns extending across multiple threads.
	For example, the allocator caches discussed in
	Section~\ref{sec:Allocator Caches}
	operate within a single thread, but the resulting memory blocks
	and associated pointers might be passed to other threads using
	some other pattern such as release-acquire or release-consume.
\item	\emph{Concurrent WW:}  ``Y'' indicates that the design pattern
	involves concurrent relaxed writes to a given object.
\item	\emph{Concurrent RW:}  ``Y'' indicates that the design pattern
	involves concurrent relaxed reads and writes to a given object,
	but not necessarly concurrent relaxed writes.
\item	\emph{But Checked:}  ``Y'' indicates that the values from
	the concurrent reads are checked if there might have been a
	concurrent write.  See for example
	Section~\ref{sec:Pre-Load for Compare and Swap}.
\item	\emph{But Discarded:}  ``Y'' indicates that the values from
	the concurrent reads are discarded if there might have been a
	concurrent write.
	See for example Section~\ref{sec:Sequence Locking}.
\item	\emph{But Fungible:}  ``Y'' indicates that set of writers are
	fungible if a reader running concurrently with those writers will
	exhibit the same behavior regardless of which of those writes'
	values that read returns.
	An important special case is when all the writers are storing
	the same value, as discussed in
	Section~\ref{sec:Java-Style Lazy Scalar Initialization}.
	Another important special case is where readers take one
	action if the value is (say) zero and another for any
	non-zero value, and all concurrent writers will write
	a non-zero value, as discussed in
	Section~\ref{sec:Atomic Reference-Count Updates}.

	In all of these ``But'' columns, ``S'' indicates
	that some examples in the category might possess the corresponding
	attribute.
\item	\emph{Unordered Cycle:}  ``Y'' indicates that the design pattern
	can produce an unordered cycle in and of itself.
	Of course, a combination of design patterns that individually
	exclude the possibility of an unordered cycle might nevertheless
	produce an unordered cycle when used in combination.
	Design and code reviews should therefore carefully consider
	ordering at the intersection of multiple design patterns.
\item	\emph{Strict C++ Safe:} ``Y'' indicates that the design pattern is
	expected to work correctly in strict C++ as well as in strict C++.
	``S'' indicates that only some examples in the category are
	agreed to be safe for strict C++, in which case the examples
	that are believe to be unsafe for strict C++ are called out in
	their respective sections.
\end{enumerate}

The key attribute that renders an idiom potentially unsafe is
the loading of an atomic value with a relaxed load, and then
relying on that value for correctness, for example, by:

\begin{enumerate}
\item	Using it to determine the value stored into another atomic, or
\item	Relying on it to avoid disastrous misbehavior.
\end{enumerate}

\begin{listing}[tbp]
\begin{verbatim}
 1 atomic<int> x(0);
 2 atomic<int> y(0);
 3
 4 void thread1()
 5 {
 6   unsigned long r1 = x.load(memory_order_relaxed);
 7   if (is_bad(r1)
 8     bad_behavior();
 9 }
10
11 void thread2()
12 {
13   unsigned long r2 = y.load(memory_order_relaxed);
14   if (is_bad(r2)
15     bad_behavior();
16 }
\end{verbatim}
\caption{Canonical Unsafe Pattern}
\label{lst:Canonical Unsafe Pattern}
\end{listing}

In this last, the potential unsafety normally stems from having
generated undefined behavior.
But this requires the loads to read a bad value, which in practice
normally cannot happen in the absence of that bad value actually being
explicitly stored somewhere.
The exception occurs in case of backwards-propagating undefined behavior.
If the \co{bad_behavior()} functions on lines~8 and~15 invoke undefined
behavior, and if that undefined behavior can back-propagate, then that
undefined behavior can justify the value loaded that resulted in the
call to the \co{bad_behavior()} function in the first place.

\section{Marking of Relaxed Design Patterns}
\label{sec:Marking of Relaxed Design Patterns}

It is currently believed that these design patterns will need to be
explicitly marked in order for code reviewers and automatic verifiers
to recognize them and validate their usage, although the authors would
love to be proven wrong on this point.
Here are some candidate marking strategies that have been discussed
within the C++ standards committee:

\begin{enumerate}
\item	Create new \co{memory_order} \co{enum} members for each new
	design pattern.
	This has the benefit of calling out the pattern in an unmistakable
	way that is visible to the compiler, but requires that each
	new design pattern be standardized.
	It also does not support the case where a given access plays a
	role in multiple overlapping design patterns.
\item	Use structured comments to mark each design pattern.
	This avoids the time delays and administrative overhead inherent
	in standardization, and could potentially allow multiple comments
	to handle a given access that plays a role in multiple overlapping
	design patterns.
\item	Use structured comments with a per-instance identifier for
	a given use of a pattern.
	The idea here is to enable tools to more easily spot
	unintended interactions between different design patterns
	being applied to a given group of objects.
	On the other hand, this raises the issue of namespace management.
\item	Define C++ \co{template} types for each design pattern.
	This is an excellent idea where it applies, as it might well
	for the statistical counters discussed in
	Section~\ref{sec:Unidirectional Data Flow}.
	However, we have reason to doubt that \co{template} types can
	be reasonably created for all possible relaxed-access design
	patterns.
\end{enumerate}

More ideation and discussion is needed on this topic.

% \section{Use of Relaxed Design Patterns}
% \label{sec:Use of Relaxed Design Patterns}
% 
% @@@

\section{Concluding Remarks}
\label{sec:Concluding Remarks}

Use of \co{memory_order_relaxed} can be tricky because we do not yet
have an efficient way to formally define the boundaries of OOTA and
RFUB.
Important \co{memory_order_relaxed} use cases work in practice, but
some have no known precise correctness argument.
We define ``strict C++'' as that portion of the standard excluding
the vague normative encouragement to avoid OOTA.
The good news is that many common \co{memory_order_relaxed} use cases
are demonstrably correct even in strict C++.

This paper starts the work of classifying known-safe design patterns
involving \co{memory_order_relaxed}.
It is hoped that this work will be of use in design and code reviews,
and that it might eventually lead to improved theoretical models of
\co{memory_order_relaxed} accesses.

\clearpage
\appendix

\section{Allocator Caches}
\label{sec:Allocator Caches}

This appendix expands on the one-way memory allocator discussed in
Section~\ref{sec:One-Way Memory Allocation}
by way of a multithreaded allocator with caches.
Once agreement is reached on the simpler one-way case, this appendix
might be promoted to the main paper.

Allocator caches provide per-CPU or per-thread pools of free memory
in order to provide high-performance scalable memory allocation in
the common case~\cite{McKenney93,Bonwick94slab}.
Accesses to these pools are normally single-threaded by design for
reasons of performance and scalability.
However, objects are often allocated for concurrent algorithms,
It may be helpful to list phases of a dynamically allocated object's
typical lifetime in a concurrent context:

\begin{enumerate}
\item	Allocation.
\item	Initialization, including construction.
\item	Use.
	This might include subphases, but given that any such
	subphases are defined by the user, safely transitioning between
	them is the user's responsibility.
	This is usually the only phase that permits concurrent access.
\item	Cleanup, including destruction.
\item	Deallocation.
\end{enumerate}

Note the possibility of memory reuse means that this is a cycle rather
than a sequence.

The key point is that there must be a happens-before edge for each
phase transition.
In the case of the cleanup to deallocation to allocation to initialization
transitions, this happens-before edge is frequently supplied by
sequenced-before, courtesy of the fact that allocator caches cause
all of those transitions to occur within a single thread in the common
case.
However, some sort of happens-before edge is required for each phase
transition regardless of which thread is executing any given phase.

In the common case, the transitions requiring other-thread-visible ordering
are those to and from the ``Use'' phase.
In particular, the complexities of transitioning from the ``Use'' phase
to the ``Cleanup'' phase has inspired safe memory reclamation schemes,
including reference counting,
hazard pointers~\cite{Michael02a,HerlihyLM02,MagedMichael04a},
and RCU~\cite{McKenney98}.

In less-common cases where inter-thread transitions occur between other
phases, proper synchronization must be provided.
For example, the earlier phase might use a release store and
the later phase might use an acquire or consume load.

\begin{listing}[tbp]
\begin{verbatim}
 1 void *heap;
 2
 3 void thread1()
 4 {
 5   r1 = x.load(memory_order_relaxed);
 6   y.store(r1, memory_order_relaxed);
 7 }
 8
 9 void thread2()
10 {
11   bool allocated(false);
12   r1 = y.load(memory_order_relaxed);
13   if (r1 != heap) {
14     allocated = true;
15     r1 = heap;
16   }
17   x.store(r1, memory_order_relaxed);
18   assert_not(allocated);
19 }
\end{verbatim}
\caption{RFUB Allocator-Like Example}
\label{lst:RFUB Allocator-Like Example}
\end{listing}

Proper phase-transition synchronization rules out the infamous RFUB
cycle shown in Listing~\ref{lst:RFUB Allocator-Like Example}.\footnote{
	Adapted from Boehm and
	Demsky\cite[Figure 5]{Boehm:2014:OGA:2618128.2618134}.}
This is because the allocation phase on line~15 is required to happen
before any later phase, a requirement that is violated by the relaxed
accesses on lines~5, 12, and~17.

Again, code reviewers should view relaxed stores of pointers to newly
allocated objects with great suspicion.

\section{History}
\label{sec:History}

This is a revision of 
``P2055R0: A Relaxed Guide to \co{memory_order_relaxed}''
based on discussions at the 2020 Prague meeting and also a
presentation\footnote{
	\url{https://docs.google.com/presentation/d/1HT-Gaj5oRT5VArx_4bVyos3y6vCkxubx1rF4Nqo4n64/edit?usp=sharing}}
to CPPCON 2020.

\bibliographystyle{plain}
\bibliography{bib/RCU,bib/WFS,bib/hw,bib/os,bib/parallelsys,bib/patterns,bib/perfmeas,bib/refs,bib/syncrefs,bib/search,bib/swtools,bib/realtime,bib/TM,bib/standards,bib/memorymodel.bib}

\end{document}