-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrelaxedguide.tex
1505 lines (1317 loc) · 59.7 KB
/
relaxedguide.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[10]{article}
% standard packages
% A more pleasant font
\usepackage[T1]{fontenc} % use postscript type 1 fonts
\usepackage{textcomp} % use symbols in TS1 encoding
\usepackage{mathptmx,helvet,courier} % use nice, standard fonts for roman, sans and monospace respectively
% Improves the text layout
\usepackage{microtype}
\usepackage{lscape}
\usepackage{fancyhdr}
\usepackage{epsfig}
\usepackage{subfigure}
\usepackage{url}
\usepackage{graphics}
\usepackage{enumerate}
\usepackage{ifthen}
\usepackage{float}
\usepackage{listings}
\lstset{basicstyle=\ttfamily}
% \usepackage[strings]{underscore}
% \usepackage{underscore}
\usepackage[bookmarks=true,bookmarksnumbered=true,pdfborder={0 0 0}]{hyperref}
\lstset{
literate={\_}{}{0\discretionary{\_}{}{\_}}%
}
\usepackage[table]{xcolor}
\usepackage{booktabs}
\DeclareUrlCommand\email{}
\pagestyle{fancy}
\rhead{}
\newfloat{listing}{tbp}{lol}
\floatname{listing}{Listing}
\begin{document}
\title{P2055R1: A Relaxed Guide to \co{memory_order_relaxed}}
\newcommand{\co}[1]{\lstinline[breaklines=yes,breakatwhitespace=yes]{#1}}
\author{
Hans Boehm\\\email{[email protected]} \and
David Goldblatt\\\email{[email protected]} \and
Paul E.~McKenney\\\email{[email protected]} \and
The Indefatigible TBD
}
\date{April 8, 2024}
\maketitle{}
Audience: SG1 (Informational)
\begin{abstract}
The out-of-thin-air (OOTA) and read-from-untaken-branch (RFUB)
properties of the specification of \co{memory_order_relaxed}
have resulted in considerable consternation over the years.
Although there are no known instances of full-blown OOTA
behavior, and no known RFUB-induced failures of production code,
the theoretical possibility of these properties severely
complicates automated analysis of large C and C++ code bases.
Thus far, attempts to eliminate OOTA and RFUB properties from
the memory model have resulted in otherwise needless added
overheads on weakly ordered systems on the one hand or
excessive implementation complexity on the other.
However, \co{memory_order_relaxed} never was intended to be used
in arbitrary code, but rather as a part of deliberate application
of specific concurrency designs.
This paper forms an initial catalog of patterns underlying such
designs, with an eye towards identifying bugs in existing code
and preventing bugs in new code.
\end{abstract}
\section{Background}
\label{sec:Background}
\subsection{Memory Ordering Strength and Expense}
\label{sec:Memory Ordering Strength and Expense}
C++ atomic operations are sequentially consistent by default.
In programs that use only sequentially consistent atomic operations,
these atomics enjoy mathematically appealing interleaving semantics.
However, no computer system in common use in the year 2023 provides
sequential consistency at the hardware level, not even relatively
strongly ordered systems such as x86.
This means that sequentially consistent atomics can be quite expensive,
and especially on weakly ordered systems such as ARM and IBM Power.
Furthermore, a great many use cases in real-world software do not
require sequentially consistent atomics, in fact, the much lower-overhead
release/acquire ordering suffices in a great many situations.
For this reason, C++ provides \co{memory_order} \co{enum} values to
allow the user to control the strength of individual atomic operations.
The \co{memory_order_release} and \co{memory_order_acquire} values
provide the aforementioned release-acquire ordering.
But most relevant to this document, \co{memory_order_relaxed} allows
arbitrary reordering of accesses to distinct locations.
With the exception of the unlamented Itanium CPU family, aligned and
machine-word-sized \co{memory_order_relaxed} loads and stores compile
to unadorned load and store instructions, providing complete control,
excellent efficiency, and stunning scalability.
And in practice, \co{memory_order_relaxed} accesses provide well-understood
ordering properties, which has led to them being heavily used.
Unfortunately, in theory \co{memory_order_relaxed} accesses are subject
to OOTA.
RFUB can occur very rarely in practice, and there is some debate as to
whether this is acceptable.
These issues are discussed in the following section.
\subsection{OOTA and RFUB}
\label{sec:OOTA and RFUB}
There has been considerable work done on OOTA and RFUB over many years,
building on prior work in the Java community, perhaps best exemplified
by the infamous ``Causality Test Cases''.\footnote{
\url{http://www.cs.umd.edu/~pugh/java/memoryModel/unifiedProposal/testcases.html}.}
There has long been hope that additional research effort will identify
a model of OOTA that all can live with, for example, on the part of Paul,
and that everyone would come to appreciate the relative simplicity of
strengthening \co{memory_order_relaxed} to forbid prior reads to be
reordered with later writes, for example, on the part of
Hans~\cite{Boehm:2014:OGA:2618128.2618134,HansBoehm2019OOTArevisitedAgain,Lahav:2017:RSC:3062341.3062352}.
And progress has been made on both fronts.
On the additional-research front, we now have methods of distinguishing
between OOTA on the one hand and simple reordering on the other.
Unfortunately, one method requires per-scenario
creativity~\cite{PaulEMcKenney2016OOTA},
while others have not as yet been looked upon with favor by compiler
implementers~\cite{Lahav:2017:RSC:3062341.3062352,Sinclair:2017:CAR:3079856.3080206,Lee:10.1145/3385412.3386010,MarkBatty2019ModularRelaxedDependenciesOOTA}.
Backwards-propagating undefined behavior can be especially
troublesome~\cite{DavidGoldblatt2019NoElegantOOTAfix},
but there proposals that this be
restricted~\cite{HansBoehm2020UBalternatives}.
Perhaps such restrictions might eliminate some of the more troublesome
examples of OOTA.
On the reads-before-writes front, there are some indications that newer
weakly ordered hardware incurs reduced penalties for ordering relaxed
reads before relaxed writes, at least for low-end and high-end systems.
However, middle-end systems still incur significant penalties.
This has led to renewed suggestions that a new \co{memory_order_load_store}
member be added to the \co{memory_order} \co{enum}.
It has also led to a renewed proposal that \co{memory_order_relaxed}
be strengthened so as to prohibit reordering of prior loads and subsequent
stores~\cite{LukeGeeson2023TightenRelaxed},
however, this proposal was not universally
loved~\cite{RichardGrisenthwaite2023TightenRelaxedJustSayNo}.
A further complication is that although there is general agreement that
OOTA behaviors must be forbidden, there is some debate on the need to
forbid RFUB behaviors.
Some of those who believe that RFUB behaviors should be allowed
(for example, Paul) argue that the scenarios where RFUB can occur
are contrived and with no known production use cases.
There is general agreement that simple reordering should be allowed, as
it occurs in practice in scenarios that are both useful and commonplace.
\subsection{OOTA Examples}
\label{sec:OOTA Examples}
This section looks at examples of simple reordering, OOTA, and RFUB.
\begin{listing}[tbp]
\begin{verbatim}
1 atomic<int> x(0);
2 atomic<int> y(0);
3
4 void thread1()
5 {
6 unsigned long r1 = x.load(memory_order_relaxed);
7 y.store(r1, memory_order_relaxed);
8 }
9
10 void thread2()
11 {
12 unsigned long r2 = y.load(memory_order_relaxed);
13 x.store(42, memory_order_relaxed);
14 }
\end{verbatim}
\caption{Simple Reordering}
\label{lst:Simple Reordering}
\end{listing}
Listing~\ref{lst:Simple Reordering}
shows an example of simple reordering.
Both the compiler and the CPU are within their rights to reorder
lines~12 and~13, in which case the following sequence of events
will result in all of \co{x}, \co{y}, \co{r1}, and \co{r2} having
the value 42:
\begin{enumerate}
\item Line~13 stores 42 to \co{x}.
\item Line~6 loads 42 from \co{x} into \co{r1}.
\item Line~7 stores \co{r1}, and thus 42, to \co{y}.
\item Line~12 loads 42 from \co{y} to \co{r2}.
\end{enumerate}
It is strongly and generally agreed that this simple reordering be
allowed.
\begin{listing}[tbp]
\begin{verbatim}
1 atomic<int> x(0);
2 atomic<int> y(0);
3
4 void thread1()
5 {
6 unsigned long r1 = x.load(memory_order_relaxed);
7 y.store(r1, memory_order_relaxed);
8 }
9
10 void thread2()
11 {
12 unsigned long r2 = y.load(memory_order_relaxed);
13 x.store(r2, memory_order_relaxed);
14 }
\end{verbatim}
\caption{Simple OOTA}
\label{lst:Simple OOTA}
\end{listing}
Listing~\ref{lst:Simple OOTA} shows an OOTA example that should be
prohibited.
where where all of \co{x}, \co{y}, \co{r1}, and \co{r2} have final values
of 42:
In actual hardware, this prohibition is enforced by TSO ordering in
strongly ordered systems and by data dependency ordering in weakly
ordered systems.
Compilers cannot store something until after they load it, which
results in instructions being emitted such that the hardware enforcement
applies.
The need to prohibit simple OOTA is one reason why compiler-based
value speculation optimizations are frowned upon.
% @@@ Hans, I omitted your array-based setup, as this involves an
% out-of-bounds array reference, which is itself UB.
% If this example is needed, please help me understand the point that
% it is making.
\begin{listing}[tbp]
\begin{verbatim}
1 atomic<int> x(0);
2 atomic<int> y(0);
3
4 void thread1()
5 {
6 unsigned long r1 = x.load(memory_order_relaxed);
7 y.store(r1, memory_order_relaxed);
8 }
9
10 void thread2()
11 {
12 bool assigned_42 = false;
13 unsigned long r2 = y.load(memory_order_relaxed);
14 if (r2 != 42) {
15 assigned_42 = true;
16 r2 = 42;
17 }
18 x.store(r2, memory_order_relaxed);
19 }
\end{verbatim}
\caption{Simple RFUB}
\label{lst:Simple RFUB}
\end{listing}
Listing~\ref{lst:Simple RFUB}
shows a simple example of RFUB, which again means ``read from untaken
branch''.
The following sequnece of events will result in all of \co{x}, \co{y},
\co{r1}, and \co{r2} having the value 42:
\begin{enumerate}
\item The compiler notices that at line~17, the value of \co{r2}
is always 42.
\item The compiler thus substitutes the value 42 for \co{r2} in
line~18.
\item Both the compiler and the CPU are within their rights to reorder
lines~13 and~18.
\item Line~18 stores 42 to \co{x}.
\item Line~6 loads 42 from \co{x} into \co{r1}.
\item Line~7 stores \co{r1}, and thus 42, to \co{y}.
\item Line~13 loads 42 from \co{y} to \co{r2}.
\item Because \co{r2} is equal to 42, lines~15 and~16 are never
executed, even though it was exactly these lines of code
that justified the above compiler optimizations.
\end{enumerate}
Many concurrency experts believe that RFUB should be allowed.
\subsection{Classification of Patterns}
\label{sec:Classification of Patterns}
Perhaps agreement on all of these points will be reached, but in the
meantime, \co{memory_order_relaxed} use is increasing, and thus an
increasing need to identify known-safe usage patterns.
In the best case, these usage patterns might be automatically checked
in existing code, but at a minimum we hope that this list will be
useful to code reviewers.
Either way, the goals are to identify bugs in existing code and to
help avoid bugs in new code.
This paper is a first step toward such a set of patterns.
The term \emph{full C++} refers to the C++20 memory model as stated in
the current draft.
The term \emph{strict C++} refers to the subset of full C++ obtained
by dropping the following normative encouragement from the C++20 memory
model:
``Implementations should ensure that no "out-of-thin-air" values are
computed that circularly depend on their own computation.''
Some (but not all) of the proto-patterns in this document are safe in
strict C++, but all of them are safe in full C++.
\section{Relaxed Design Patterns}
\label{sec:Relaxed Design Patterns}
Many of these patterns are taken from Hans's \co{memory-model-design}
posting on September 4, 2018.\footnote{
Message-ID: \co{<CAMOCf+jchGw6DeE2NyCJA3wfFbNH-WFn59JruZPSWt9_jPW9NQ@mail.gmail.com>}.}
In the examples, \co{x} and \co{y} denote potentially shared locations,
while \co{r1} and \co{r2} denote local variables (``registers'') whose
addresses are not taken.
\subsection{Non-Racing Accesses}
\label{sec:Non-Racing Accesses}
Any non-racing access to an atomic object can be a relaxed access.
Because the access is not concurrent with a conflicting access (store
against either store or load), further ordering is unnecessary.\footnote{
This covers case~\#8 in Hans's September 4, 2019 email.}
In fact, such accesses can in theory be non-atomic.
In environments where atomicity is controlled by the access rather
than the object definition, such accesses are often non-atomic in
practice~\cite{JadeAlglave2019WhoAfraidCompiler}.
\begin{listing}[tbp]
\begin{verbatim}
1 int x = 0;
2 int y = 0;
3
4 void thread1()
5 {
6 if (x)
7 y = 1;
8 }
9
10 void thread2()
11 {
12 if (y)
13 x = 1;
14 }
\end{verbatim}
\caption{Non-Atomic Accesses Sometimes Respect Control Dependencies}
\label{lst:Non-Atomic Accesses Sometimes Respect Control Dependencies}
\end{listing}
For example, given concurrent execution of \co{thread1()} and
\co{thread2()} in
Listing~\ref{lst:Non-Atomic Accesses Sometimes Respect Control Dependencies},
the only permitted outcome results in both \co{x} and \co{y} being
equal to zero in both full C++ and strict C++.
Any other outcome would violate the ``sequential consistency for
data race free programs'' principle, and must effectively be due to a
compiler-created data race, which is forbidden.
\begin{listing}[tbp]
\begin{verbatim}
1 std::atomic<int> x = 0;
2 std::atomic<int> y = 0;
3
4 void thread1()
5 {
6 if (x.load(memory_order_relaxed))
7 y.store(1, memory_order_relaxed);
8 }
9
10 void thread2()
11 {
12 if (y.load(memory_order_relaxed)
13 x.store(1, memory_order_relaxed);
14 }
\end{verbatim}
\caption{Strict C++ Does Not Require Atomics to Respect Control Dependencies}
\label{lst:Strict C++ Does Not Require Atomics to Respect Control Dependencies}
\end{listing}
In contrast, in the analogous program using C++ atomics
(see Listing~\ref{lst:Strict C++ Does Not Require Atomics to Respect Control Dependencies}),
additional behaviors are permitted by strict C++,
including the one resulting in the final values of both \co{x}
and \co{y} being \co{1}.
The restriction to ``strict C++'' is important because this code fragment
is considered to be an example of the OOTA behavior that is forbidden
by the normative encouragement in that same standard.
In short, although any non-racing access to an atomic object may be
relaxed, strict C++ counter-intuitively classifies many access patterns
as racy.
\begin{listing}[tbp]
\begin{verbatim}
1 if (!x_init.load(memory_order_acquire)) {
2 lock_guard<mutex> _(x_init_mtx);
3 if (!x_init.load(memory_order_relaxed)) {
4 initialize(&x);
5 x_init.store(true, memory_order_release);
6 }
7 }
\end{verbatim}
\caption{Double-Checked Locking}
\label{lst:Double-Checked Locking}
\end{listing}
Relaxed atomics can nevertheless be useful for non-racing accesses
in real-life situations, as can be seen in the double-checked
locking example shown in
Listing~\ref{lst:Double-Checked Locking}.
Line~1 uses an acquire load from \co{x_init} to check whether
initialization is needed, in which case line~2 acquires \co{mutex}
using an anonymous \co{lock_guard} (anonymity being designated by the
underscore where a local-variable name would be expected).
Once this lock is held, line~3 uses a relaxed load to recheck \co{x_init}
to see whether initialization is still needed.
A relaxed load works here because holding the lock prevents other
threads from storing to \co{x_init}.
Line~4 carries out the initialization, and line~5 updates \co{x_init}
to indicate that initialization is complete.
When the acquire load on line~1 returns the value \co{true}, that
load synchronizes with the release store on line~5, guaranteeing
that any code following line~7 sees the results of that initialization.
These patterns can be at least partially checked using data-race
detectors, both static and runtime.
\subsubsection{Initialization and Cleanup}
\label{sec:Initialization and Cleanup}
Important special cases of this pattern are the single-threaded
initialization and cleanup phases of an otherwise concurrent program.
These use cases are one motivation for the strong ordering guarantees
of thread creation and destruction.
These guarantees permit the single-threaded initialization and cleanup
code to run race free, with no need to consider interference from the
intervening code that runs multithreaded.
\subsubsection{Lock-Based Critical Sections}
\label{sec:Lock-Based Critical Sections}
Exclusive locks provide mutual exclusion, so that objects accessed
only while holding a given lock may be accessed using
\co{memory_order_relaxed} accesses, or, for that matter, using
non-atomic accesses.
Reader-writer locks provide a weaker form of mutual exclusion,
However, objects that are updated only while a given reader-writer
lock is write-held and read only when that same lock is either
read-held or write-held may also be accessed using
\co{memory_order_relaxed} accesses, or, again, using non-atomic accesses.
Of course, non-atomic accesses are almost always used with pure locking.
However, \co{memory_order_relaxed} accesses are sometimes quite useful,
for example, in cases where objects pass through a software pipeline,
where one stage uses pure locking and another stage relies on atomic
operations.
% @@@ Hans: Similar patterns exist for RCU and for hazard pointers.
% I am definitely not adding these this close to the deadline,
% but perhaps the next version.
\subsection{Single-Location Data Structures}
\label{sec:Single-Location Data Structures}
Relaxed atomic operations provide sequentially consistent access to
a single object.
This means that data structures that fit into a single object can
be accessed with relaxed atomics with no possibility of OOTA or
RFUB behavior.
Note well that a group of single-location data structures might well
interact in a way that could raise the spectre of OOTA or RFUB.
As before, design review should therefore pay careful attention to
information flow.
These patterns can be checked by verifying that no store to another shared
variable is affected by the value of the single-location data structure,
unless that value can be shown not to affect that same single-location
data structure, for example, if that other shared variable is part of
a unidirectional data flow
(see Section~\ref{sec:Unidirectional Data Flow}).
\subsection{Shared Fences}
\label{sec:Shared Fences}
The \co{atomic_thread_fence()} function can be used to order multiple
accesses.
For example, consider a series of acquire loads that are intended to provide
order against subsequent accesses, but not against each other.
Because the compiler will normally not be able to determine that the
acquire loads need not be ordered against each other, on some platforms
this will result in a memory-fence instruction being emitted after each
and every acquire load, when only the last fence is required.
These unnecessary fences can be avoided by replacing the acquire
loads with relaxed loads followed by a single
\co{atomic_thread_fence(memory_order_acquire)}~\cite[Section 4.1]{RaulSilvera2007WeakMemoryModel}.
This will have the desired effect of ordering all of these loads with any
subsequent accesses, while also avoiding the overhead of the redundant
fence instructions that would be expected from the acquire loads.
Similarly, consider a series of release stores that are intended to provide
order against prior accesses, but not against each other.
Again, the compiler might emit a memory-fence instruction before
each of the stores, when only the first fence is required.
These unnecessary fences can be avoided by replacing the release
stores with relaxed stores preceded by a single
\co{atomic_thread_fence(memory_order_release)} followed by
a series of relaxed stores~\cite[Section 4.2]{RaulSilvera2007WeakMemoryModel}.
This will have the desired effect of ordering all of these stores with any
prior accesses, while also avoiding the overhead of any redundant
fence instructions emitted for the release stores.
In many cases, other ordered atomic operations may be substituted for
the \co{atomic_thread_fence()} operations.
For example, \co{synchronize_rcu()}
(also known as \co{rcu_synchronize()} in recent C++ working papers) implies
the semantics of \co{atomic_thread_fence(memory_order_acqrel)}.
In addition, thread creation and thread join provide the ordering
semantics of:
\begin{itemize}
\item \co{atomic_thread_fence(memory_order_release)} for thread creation.
\item \co{atomic_thread_fence(memory_order_acquire)} for the initialization of the
corresponding thread.
\item \co{atomic_thread_fence(memory_order_release)} for the termination of the
corresponding thread.
\item \co{atomic_thread_fence(memory_order_acquire)} for the join operation
of the corresponding thread.
\end{itemize}
It is of course more conventional to consider thread creation to
synchronize with the created thread and thread termination to synchronize
with the corresponding join operation.
This leads to the final example, which is that the shared-fences pattern
also applies to any pair of calls to library where one synchronizes with
the other.
In this design pattern, OOTA and RFUB behaviors are ruled out by the semantics
of \co{atomic_thread_fence()} or by synchronizes-with, as the case may be.
This pattern can in theory be checked by verifying that all accesses
are associated with a fence, however, current checker technology likely
requires that the associated fence be explicitly called out.
\subsection{Atomic Reference-Count Updates}
\label{sec:Atomic Reference-Count Updates}
In certain reference-count use cases, the ordering of the increments and
decrements is irrelevant.
One common case is where it is only legal to increment the reference
count when the incrementing thread already holds a reference, in which
case the count cannot possibly decrease to zero in the meantime.
Because only the one-to-zero transition requires ordering, reference-count
increments can be relaxed in cases where another reference is guaranteed
to be held throughout.
Similarly, reference-count decrements can also be relaxed, but only if
the thread will still hold at least one reference after the decrement.
In other words, a thread releasing its last reference is forbidden
from using a relaxed operation to do so, because in that case there
is no guarantee that another reference is guaranteed to be held
throughout.\footnote{
More elaborate variants of this pattern allow these rules to
be relaxed.
For example, if a parent thread is guaranteed not to release
its last reference until after joining with its child threads,
then those child threads may use relaxed decrements to release
their final reference.}
This pattern can be checked in conjunction with lock dependency checkers
such as the Linux kernel's lockdep facility~\cite{JonathanCorbet2006lockdep}.
We suspect that this is an example of a more general class of patterns,
but other examples of such a class do not immediately come to mind.
One can of course imagine things like preprocessed sensor values where
these values are irrelevant except in their relation to cutoff values.
We would welcome examples used in actual code.
\subsection{Untrusted Loads}
\label{sec:Untrusted Loads}
In many cases, it is acceptable for a load from an atomic shared variable
to occasionally return random bits because the value is checked by
some later operation.
In such cases, the load can be a relaxed load.
\begin{listing}[tbp]
\begin{verbatim}
1 unsigned long expected = x.load(memory_order_relaxed);
2 while (!x.compare_exchange_weak(expected, f(expected))
3 continue;
\end{verbatim}
\caption{Untrusted Load Checked by CAS}
\label{lst:Untrusted Load Checked by CAS}
\end{listing}
Listing~\ref{lst:Untrusted Load Checked by CAS}
demonstrates this pattern.
If misordering, OOTA, or RFUB were to cause line~1 to return a bogus
value, then the ~\co{compare_exchange_weak()} on line~2 would fail,
implicitly re-loading the value and retrying.
\subsubsection{Pre-Load for Compare and Swap}
\label{sec:Pre-Load for Compare and Swap}
Perhaps the most well-known later checking operation is a non-relaxed
compare-and-swap (CAS).
The \co{atomic_compare_exchange_*()} family of read-modify-write
CAS operations are typically used in a loop, and often require an initial
load prior to the first pass through that loop.
For non-relaxed CAS operations, this initial load can typically be a
relaxed load, with the CAS operation's ordering preventing OOTA and RFUB
behaviors.
Relaxed CAS operations need to be part of some other design pattern
(for example, the shared fences pattern called out in
Section~\ref{sec:Shared Fences})
if cycles containing them are to be guaranteed to be OOTA/RFUB-free in
conjunction with an initial relaxed load.
One common design pattern is the single-location data structure discussed in
Section~\ref{sec:Single-Location Data Structures}.
Additional examples are presented by
Sinclair et al.~\cite{Sinclair:2017:CAR:3079856.3080206}.
This pattern can be checked by verifying that the values from the relaxed
loads propagate only to a CAS operation.
\subsubsection{Sequence Locking}
\label{sec:Sequence Locking}
The accesses within a sequence-locking read-side critical section
can used relaxed loads because any concurrency with the corresponding
update will result in a retry, thus discarding any loaded values.
Assuming that sequence-locking readers never store to shared memory,
this not only prevents the surfacing of any OOTA or RFUB cycles, but
also of any other non-SC behaviors.
Note that a proposal\footnote{
\url{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1478r0.html}}
provides an \co{atomic_load_per_byte_memcpy()} that allows safe
non-atomic access to data for sequence-lock readers, as well as an
\co{atomic_store_per_byte_memcpy()} to update that same data by
sequence-lock updaters.
It is nevertheless quite possible that some sequence-lock readers
might continue to use relaxed atomics in order to permit reliable
computations within readers in the presence of data objects having
trap representations.
Furthermore, sophisticated sequence-locking use cases may need to
use relaxed accesses for other reasons.
For example, the Linux kernel's lockless path-to-\co{inode} traversal uses
the closely related sequence counters to detect large-scale changes
to the filesystem tree that would otherwise confuse this
traversal~\cite{NeilBrown2015PathnameLookup,NeilBrown2015RCUwalk}.
Such confusion could result is a number of anomalies, including successful
lookup of paths that never actually existed.
This pattern can be checked by enlisting the aid of lock dependency
checkers to verify that the access is within the scope of a sequence
lock reader.
Checking that the value does not leak out of that sequence lock is
more difficult.
\subsection{Unidirectional Data Flow}
\label{sec:Unidirectional Data Flow}
If data flows only in one direction, then OOTA cycles cannot form.
The following sections give several examples of this general design
pattern.
\subsubsection{Independent Input Data}
\label{sec:Independent Input Data}
Input data consisting of independent objects may be read using relaxed
accesses because these objects are not affected by downstream computations.
Here input data is defined broadly, including:
\begin{enumerate}
\item Measurements of outside environmental conditions.
\item Device configuration data.
\item Software configuration data.
\item Security policies.
\item Network routing information.
\end{enumerate}
The key point is that the concurrent-computation portion of application
references but does not modify this data, and that there are no
object-to-object consistency constraints.
\subsubsection{Independent Output Data}
\label{sec:Independent Output Data}
Similarly, output data consisting of independent objects may be written
using relaxed accesses because these objects do not affect upstream
computations.
As before, output data is defined broadly, including:
\begin{enumerate}
\item Control of objects external to the computer.
\item Many classes of debug output.
\item Some use cases involving video frame buffers.
\item Some use cases involving communication to a later stage of a
software pipeline.
\end{enumerate}
Similar to the independent input data discussed in the preceding section,
the key point is that the concurrent-computation portion of application
modifies but does not reference this data, and that there are no
object-to-object consistency constraints.
\begin{listing}[tbp]
\begin{verbatim}
1 atomic<int> s1(0);
2 atomic<int> s2(0);
3
4 void thread1()
5 {
6 int s1 = get_ext_state(1);
7 int s2 = get_ext_state(2);
8 cs1.store(reduce_state(s1), memory_order_relaxed);
9 cs2.store(reduce_state(s2), memory_order_relaxed);
10 }
11
12 void thread2()
13 {
14 int c = compute_ctl(cs1.load(memory_order_relaxed,
15 cs2.load(memory_order_relaxed));
16 set_ext_ctl(c);
17 }
\end{verbatim}
\caption{Unidirectional I/O Data Flow}
\label{lst:Unidirectional I/O Data Flow}
\end{listing}
Listing~\ref{lst:Unidirectional I/O Data Flow}
combines item~2 from the lists in
Sections~\ref{sec:Independent Output Data}
and~\ref{sec:Independent Input Data}.
The \co{thread1()} input data flows from the \co{get_ext_state()} functions
through the \co{reduce_state()} functions into the \co{cs1} and \co{cs2}
shared variables.
The \co{thread2()} output data flows from these same \co{cs1} and \co{cs2}
shared variables though \co{compute_ctl()} and finally is output
by \co{set_ext_ctl()}.
The data flow is unidirections from input to output, so no OOTA cycles
can form.
\subsubsection{Statistical Counters}
\label{sec:Statistical Counters}
The canonical instance of a unidirectional data-flow pattern is the
statistical counter, in which each thread (or CPU, as the case may be)
updates its own counter, and the aggregate value of the counter is read
out by summing all threads'
counters~\cite[Section 5.2]{McKenney2018ParallelProgramming-2018-12-08a}.
Statistical counters do have concurrent updates and reads, and thus must
use atomics.
However, the concurrent reads can be modeled as returning approximate
results (for example, for monitoring or debugging), and can in fact be
modeled as sequentially consistent approximate operations.
But more to the point, data flow in real use cases is always
unidirectional, proceeding from the updater responding to an event
and flowing through the counter to some reader displaying or logging
statistics.
This unidirectional data flow precludes the cycles required for OOTA or
RFUB behavior to manifest.
\begin{listing}[tbp]
\begin{verbatim}
1 StatCounter<unsigned long> a;
2 StatCounter<unsigned long> b;
3
4 void thread1()
5 {
6 unsigned long r1 = a.readout();
7 b.increase(r1);
8 }
9
10 void thread2()
11 {
12 unsigned long r2 = b.readout();
13 a.increase(r2);
14 }
\end{verbatim}
\caption{Statistical-Counter Abuse and OOTA}
\label{lst:Statistical-Counter Abuse and OOTA}
\end{listing}
An example abuse is shown in
Listing~\ref{lst:Statistical-Counter Abuse and OOTA}.
Lines~1 and~2 define a pair of statistical counters \co{a} and \co{b}.
The \co{thread1()} and \co{thread2()} functions form a classic
data-dependent OOTA cycle.
Assuming both statistical counters start out with all counters zero,
we could in theory see the following OOTA sequence of events:
\begin{enumerate}
\item Line~6 sums \co{a}'s counters, obtaining the sum 42.
\item Line~7 increases the current component of \co{b}'s counter by 42.
\item Line~12 sums \co{b}'s counters, obtaining the sum 42 due to the
increase from line~7.
\item Line~13 increases the current component of \co{a}'s counter by 42,
thus justifying the sum of 42 obtained by line~6.
\end{enumerate}
Of course, the code in
Listing~\ref{lst:Statistical-Counter Abuse and OOTA}
is complete nonsense: Counters should count events, not each others's
cumulative values.
The code as written is about as useful as the proverbial screen door in
a submarine.
Problems of this sort should be located in a code review, or better yet
during the preceding design review.\footnote{
Yes, this could be considered analogous to a difference-equation
control system.
But in that case, the system being controlled is part of the
loop, and proper synchronization must be used when communicating
with that system.
In addition, the actual difference-equation computation will
normally be single-threaded.
More importantly, if the system being controlled might pose a threat
to life and limb, the design review had jolly well better be
sufficiently well-informed and thorough as to avoid this sort
of problem.}
Exact values are sometimes obtained from statistical counters in
stop-the-world situations, such as checking for consistent results
at the end of a stress test or benchmarking
run~\cite[Sections 5.3 and 5.4]{McKenney2018ParallelProgramming-2018-12-08a}.
Alternatively, counter updates might be carried out while read-holding
a given reader-writer lock and counter reads while write-holding
that same lock.
In all of these cases, OOTA and RFUB behaviors are additionally avoided due
to the fully synchronized nature of the readout.
\subsubsection{Software Pipelines}
\label{sec:Software Pipelines}
Software pipelines break computation up into stages that might proceed
concurrently.
If the interface between a consecutive pair of stages is simple enough,
relaxed accesses might be used for the corresponding communication of data.
Pipelines are not necessarily strictly linear, in fact it can be quite
advantageous to have concurrent stages feeding into a single subsequent
stage via a reduction step.
If the output of the concurrent stages is sufficiently simple, the
reduction step might be a simple relaxed atomic fetch-and-op operation
to a single scalar object.
An example of a sufficiently simple output is event counts emanating from
concurrent stream processing feeding into later sequential logic.
Note that the independent input and output data patterns discussed in
Sections~\ref{sec:Independent Input Data} and~\ref{sec:Independent Output Data}
might be endpoints of a software pipeline.
\subsubsection{Owner Field for Re-Entrant Mutex}
\label{sec:Owner Field for Re-Entrant Mutex}
This pattern is first analyzed for full C++, and then for strict C++.
Spoiler warning: There is reason to believe that this pattern works
in full C++, but not in strict C++.
\begin{listing}[tbp]
\begin{verbatim}
1 class my_reentrant_mutex {
2 std::mutex m;
3 // Writes of owner are protected by m.
4 // Only owner writes or clears its id.
5 std::atomic<std::thread::id> owner; // id() if not held
6 // Protected by m.
7 int count; // Held count-1 times by owner
8 . . .
9 }
10
11 void my_reentrant_mutex::lock() {
12 std::thread::id me = std::this_thread::get_id();
13 // No other thread can change whether owner == me.
14 if (owner.load(memory_order_relaxed) == me) {
15 ++mutex.count; // Done; reacquired the lock.
16 } else {
17 ... // Acquire m, leaving count == 0
18 owner.store(me, memory_order_relaxed);
19 }
20 }
\end{verbatim}
\caption{Re-Entrant Mutex Owner Field}
\label{lst:Re-Entrant Mutex Owner Field}
\end{listing}
Pseudo-code for a re-entrant exclusive mutex is shown in
Listing~\ref{lst:Re-Entrant Mutex Owner Field}.
Each mutex must track its owner (\co{owner} on line~5) in order to avoid
self-deadlock when the owner re-acquires a mutex that it already holds.
This owner field is updated only while the mutex is held (line~18), and
its value is used only to compare for equality to the current thread's ID.
Before releasing the mutex, the owner writes a special ID to the owner
field that is guaranteed not to match the ID of any thread.
Other threads can access the owner concurrently with the owner's
update, so the owner field must be atomic in order to avoid data races.
Of course, a nesting counter (\co{count} on line~7) must also be used
in order identify the outermost lock-release operation, however this
counter is accessed only by the thread currently holding the lock
(line~15).
Therefore, if the lock works correctly, exclusive access will be provided
to the nesting counter, as is required.
Those wishing to produce a proof of correctness are encouraged to try
induction.
However, the only time that the owner field can be equal to the thread ID
is when that thread carried out the last update to the owner field and
still holds the mutex:
\begin{enumerate}
\item Each thread writes only its ID or the special ID.
\item Because \co{memory_order_relaxed} loads are single-variable
SC, and because each thread sets the owner field to the special
ID before releasing the mutex, a given thread cannot see its own
ID unless it still holds the mutex.
\item Because atomic accesses forbid load tearing, each load from
the owner field will return either the special ID or the thread
ID corresponding to some thread that recently held the mutex.
\item Therefore, when a thread is not holding the mutex, it is guaranteed
not to load its own ID from the owner field.
\end{enumerate}
No other thread is allowed to write to the owner field while the mutex
is held, so it is impossible to form the cycles required for OOTA or
RFUB behavior to manifest.
Therefore, both the reads from and the writes to the owner field
may use \co{memory_order_relaxed}.
This is a special case of unidirectional data flow, with the data flowing
from the mutex holder to threads not holding the mutex.
The mutual exclusion provided by the mutex prevents any OOTA or RFUB
cycles from forming.
% More generally, if an object is written only while a given mutex is
% held, all accesses to that object may be relaxed without the possibility
% of OOTA or RFUB behaviors.
% --- If all writes to all objects are protected by a given mutex, maybe.
However, things might well be more difficult in strict C++.
These potential difficulties stem from the possibility of undefined
behavior (UB) back-propagating through a cycle so as to justify the OOTA
behavior~\cite{DavidGoldblatt2019NoElegantOOTA}.
To see the rationale for this back-propagating self-justifying UB (BPSJUB?),
consider a pair of threads each concurrently attempting to acquire a
mutex, but where (incorrectly and inconsistently) each see that the
owner field matches their respective thread IDs.
Both threads would then simultaneously execute within their respective
critical sections, which could result in UB.
UB can back-propagate in time, which could quite possibly result
in the threads each seeing their own values in the owner field, which
is what instigated the UB in the first place.\footnote{
Full disclosure: Paul wrote this paragraph, and he is not
completely sold on back-propagating self-justifying UB. The
critical reader should therefore review both David Goldblatt's
working paper~\cite{DavidGoldblatt2019NoElegantOOTA} and Hans
Boehm's recent proposal~\cite{HansBoehm2020ConcurrentUB}.}
% @@@ Hans: How can memory_order_load_store help? If all of the
% accesses in the critical section are non-atomic, then
% there is nothing for that memory_order_load_store access
% to order with! Yes, the the compiler is required to
% avoid introducing data races involving those non-atomics,
% but all that the compiler is required to do in that case
% is to avoid hoisting those accesses above the owner-field
% load. What am I missing here?
Therefore, developers and reviewers should assume that owner fields for
re-entrant mutexes require full C++ in order to operate correctly.
\subsubsection{One-Way Memory Allocation}
\label{sec:One-Way Memory Allocation}
One-way memory allocation provides fresh memory that is never deallocated,
or that is deallocated using a heavy weight one-sided mechanism, for
example, a stop-the-world deallocation phase.