-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathRCU-bindings.tex
1097 lines (969 loc) · 42.6 KB
/
RCU-bindings.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[letterpaper,10pt]{article}
\usepackage{epsfig,endnotes}
%\usepackage{usenix,epsfig}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{subfig}
\usepackage{fixltx2e}
\usepackage{url} % \url{} command with good linebreaks
\usepackage{amssymb,amsmath}
\usepackage{graphicx}
\usepackage{enumerate}
\usepackage{listings}
\usepackage{xspace}
\usepackage[table]{xcolor}
\usepackage{rotating}
\lstset{basicstyle=\ttfamily}
\usepackage[bookmarks=true,bookmarksnumbered=true,pdfborder={0 0 0}]{hyperref}
% Avoid widows and orphans
\widowpenalty=500
\clubpenalty=500
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\lstset{
literate={\_}{}{0\discretionary{\_}{}{\_}}%
}
\newcommand{\co}[1]{\lstinline[breaklines=yes,breakatwhitespace=yes]{#1}}
\title{D0461R2: Proposed RCU C++ API}
\author{
{\bf Doc. No.: } WG21/D0461R2 \\
{\bf Date: } 2017-09-26 \\
{\bf Reply to: } Paul E. McKenney, Maged Michael, Michael Wong,\\
Isabella Muerte, Arthur O'Dwyer, David Hollman, \\
Andrew Hunter, Geoffrey Romer, and Lance Roy \\
{\bf Email: } [email protected], [email protected], \\
} % end author
% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
%\thispagestyle{empty}
\pagestyle{myheadings}
\markright{WG21/D0461R2}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This document is based on WG21/P0279R1 combined with feedback at
the 2015 Kona, 2016 Jacksonville, 2016 Issaquah, 2017 Kona, and
2017 Toronto meetings, which most notably called
for a C++-style method of handling different RCU implementations or
domains within a single translation unit, and which also contains
useful background material and references.
Later feedback and evaluation of existing RCU implementations permitted
significant simplification.
These simplifications eliminated domains, and this elimination will
likely be revisited, certainly before any merge into the International
Standard.
Unlike WG21/P0279R1, which simply introduced RCU's C-language practice,
this document presents proposals for C++-style RCU APIs.
At present, it appears that these are not conflicting proposals, but
rather ways of handling different C++ use cases resulting from
inheritance, templates, and different levels of memory pressure.
This document also incorporates content from
WG21/P0232R0\cite{PaulEMcKennneyToolKitP0232R0}.
Note that this proposal is related to the hazard-pointer proposal in
that both proposals defer destructive actions such as reclamation until
all readers have completed.
See P0233R3, which updates ``P0233R2: Hazard Pointers:
Safe Resource Reclamation for Optimistic Concurrency''
at \url{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0233r2.pdf}.
This proposal is also related to ``P0561R0 An RAII Interface for Deferred
Reclamation'' at
\url{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0561r0.html}.
This RAII proposal has replaced the C++ wrapper APIs that appeared in
earlier revisions of this document.
There have been other proposals for C++ RCU APIs~\cite{MaxKhiszinsky2015C++RCU}.
Note also that a redefinition of the infamous \co{memory_order_consume}
is the subject of two separate papers:
\begin{enumerate}
\item P0190R3, which updates
``P0190R2: Proposal for New \co{memory_order_consume} Definition'',
\url{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r2.pdf}.
\item P0462R1, which updates
``P0462R0: Marking \co{memory_order_consume} Dependency Chains'',
\url{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0462r0.pdf}.
Note however that P0462R1 is expected to be obsoleted by
an alternative proposal by JF Bastien that has seen production
use.
\end{enumerate}
Draft wording for this proposal may be found in the new working paper
``P0566R2: Proposed Wording for Concurrent Data Structures: Hazard
Pointer and Read-Copy-Update (RCU)''.
A detailed change log appears starting on
page~\pageref{sec:Change Log}.
\section{Introduction}
\label{sec:Introduction}
This document proposes C++ APIs for read-copy update (RCU).
For more information on RCU, including RCU semantics, see
WG21/P0462R0 (``Marking \co{memory_order_consume} Dependency Chains''),
WG21/P0279R1 (``Read-Copy Update (RCU) for C++''),
WG21/P0190R2 (``Proposal for New \co{memory_order_consume} Definition''),
and
WG21/P0098R1 (``Towards Implementation and Use of \co{memory_order_consume}'').
Specifically, this document proposes
\co{rcu_reader} (Figure~\ref{fig:RAII RCU Readers}),
\co{rcu_obj_base} (Figure~\ref{fig:Retiring RCU-Protected Objects}), and
several free functions
(Figures~\ref{fig:Retiring RCU-Protected Objects}
and~\ref{fig:RCU Updaters}).
Section~\ref{sec:Existing C-Language RCU API} presents the base (C-style) RCU API,
Section~\ref{sec:RAII RCU Readers} presents a proposal for scoped RCU readers,
Section~\ref{sec:Retiring RCU-Protected Objects} presents proposals for handling of
RCU callbacks,
Section~\ref{sec:Hazard Pointers and RCU: Which to Use?} presents a
table comparing reference counting, hazard pointers, and RCU, and finally
Section~\ref{sec:Summary} presents a summary.
This is followed by an informational-only appendix that shows
some alternatives that were considered and rejected.
A fully functional implementation is available on github:
\url{https://github.com/paulmckrcu/RCUCPPbindings}.
See the Test/paulmck directory: Other directories contain other
alternatives that were considered and rejected, although the efforts
of their respective authors are deeply appreciated.
Their work was a critically important part of the learning process
leading to the solution presented in this document.
\section{Existing C-Language RCU API}
\label{sec:Existing C-Language RCU API}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 void rcu_read_lock();
2 void rcu_read_unlock();
3 void synchronize_rcu();
4 void call_rcu(struct rcu_head *rhp,
5 void (*cbf)(struct rcu_head *rhp));
6 void rcu_barrier();
7 void rcu_register_thread();
8 void rcu_unregister_thread();
9 void rcu_quiescent_state();
10 void rcu_thread_offline();
11 void rcu_thread_online();
12 void defer_rcu(void (*fct)(void *p), void *p);
\end{verbatim}
}
\caption{Existing C-Language RCU API}
\label{fig:Existing C-Language RCU API}
\end{figure}
Figure~\ref{fig:Existing C-Language RCU API}
shows the existing C-language RCU API as provided by implementations such as
userspace RCU~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCU}.
This API is provided for compatibility with existing C-language practice as
well as to provide the highest performance for fast-path code.
As we will see in
Section~\ref{sec:Existing C-Language RCU API and C++},
the C++ user will not need to be concerned with most
of these API members.
\subsection{Existing C-Language RCU API Detailed Description}
\label{sec:Existing C-Language RCU API Detailed Description}
Lines~1 and~2 show \co{rcu_read_lock()} and \co{rcu_read_unlock()},
which mark the beginning and the end, respectively, of an \emph{RCU read-side
critical section}.
These primitives may be nested, and matching \co{rcu_read_lock()}
and \co{rcu_read_unlock()} calls need not be in the same scope.
(That said, it is good practice to place them in the same scope
in cases where the entire critical section fits comfortably into
one scope.)
Line~3 shows \co{synchronize_rcu()}, which waits for any pre-existing
RCU read-side critical sections to complete.
The period of time that \co{synchronize_rcu()} is required to wait is
called a \emph{grace period}.
Note that a given call to \co{synchronize_rcu()} is \emph{not} required to
wait for critical sections that start later.
Lines~4 and~5 show \co{call_rcu()}, which, after a subsequent grace period
elapses, causes the \co{cbf(rhp)} \emph{RCU callback function} to be invoked.
Thus, \co{call_rcu()} is the asynchronous counterpart to
\co{synchronize_rcu()}.
In most cases, \co{synchronize_rcu()} is easier to use, however, \co{call_rcu()}
has the benefit of moving the grace-period delay off of the updater's
critical path.
Use of \co{call_rcu()} is thus critically important for good performance of
update-heavy workloads, as has been demonstrated by implementation experience
across a variety of environments~\cite{PaulEMcKenney2015ReadMostly}.
Note that although \co{call_rcu()}'s callbacks are guaranteed not to be
invoked too early, there is no guarantee that their execution won't be
deferred for a considerable time.
This can be a problem if a given program requires that all outstanding
RCU callbacks be invoked before that program terminates.
The \co{rcu_barrier()} function shown on line~6 is intended for this
situation.
This function blocks until all callbacks corresponding to previous
\co{call_rcu()} invocations have been invoked and also until after
those invocations have returned.
Therefore, taking the following steps just before terminating a program
will guarantee that all callbacks have completed:
\begin{enumerate}
\item Take whatever steps are required to ensure that there are no
further invocations of \co{call_rcu()}.
\item Invoke \co{rcu_barrier()}.
\end{enumerate}
Carrying out this procedure just prior to program termination can be very
helpful for avoiding false positives when using tools such as valgrind.
Many RCU implementations require that every thread announce itself to
RCU prior to entering the first RCU read-side critical section, and
to announce its departure after exiting the last RCU read-side
critical section.
These tasks are carried out via the \co{rcu_register_thread()} and
\co{rcu_unregister_thread()}, respectively.
The implementations of RCU that feature the most aggressive implementations of
\co{rcu_read_lock()} and \co{rcu_read_unlock()} require that each thread
periodically pass through a \emph{quiescent state}, which is announced to RCU
using \co{rcu_quiescent_state()}.
A thread in a quiescent state is guaranteed not to be in an RCU
read-side critical section.
Threads can also announce entry into and exit from \emph{extended
quiescent states}, for example, before and after blocking system
calls, using \co{rcu_thread_offline()} and \co{rcu_thread_online()}.
Finally, \co{defer_rcu()} can be thought of as a non-intrusive
variant of \co{call_rcu()} that maintains an array of references
to memory awaiting callback invocation, as opposed to the traditional
implementations of \co{call_rcu()}, which maintain references in a
linked list.
\subsection{Existing C-Language RCU API and C++}
\label{sec:Existing C-Language RCU API and C++}
Because of recent advances in both userspace RCU and the Linux kernel's
support for userspace RCU, C++ users will not need to be concerned
with most of the C-language API members shown in
Figure~\ref{fig:Existing C-Language RCU API}.
Both \co{rcu_read_lock()} and \co{rcu_read_unlock()} are encapsulated
in a C++ RAII class named \co{rcu_reader}.
The \co{synchronize_rcu()} function is exposed via the
\co{std::synchronize_rcu()} free function, and
the \co{rcu_barrier()} free function is exposed via the
\co{std::rcu_barrier()} free function.
The \co{call_rcu()} function is exposed via the
\co{std::rcu_obj_base<T,D>::retire()} member function,
and a non-intrusive variant of \co{call_rcu()} is exposed via the
\co{std::rcu_retire()} templated free function.
We expect that \co{rcu_register_thread()} and \co{rcu_unregister_thread()}
will be buried into the thread-creation and thread-exit portions of the
standard library.
Use of the \co{sys_membarrier()} RCU implementation from the userspace
RCU library allows \co{rcu_quiescent_state()}, \co{rcu_thread_offline()},
and \co{rcu_thread_online()} to be dispensed with.
Of course, implementation experience and use may result in changes
to the C++ API as well as to the userspace RCU library.
For example, the possibility of RCU domains is discussed in
Appendix~\ref{sec:RCU Domains},
and a few historical methods of handling retire are shown in
Appendix~\ref{sec:Historical RCU-Protected Retirement Plans}.
\section{RAII RCU Readers}
\label{sec:RAII RCU Readers}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 class std::rcu_reader {
2 public:
3 rcu_reader() noexcept;
4 rcu_reader(std::defer_lock_t) noexcept;
5 rcu_reader(const rcu_reader &) = delete;
6 rcu_reader(rcu_reader &&other) noexcept;
7 rcu_reader& operator=(const rcu_reader&) = delete;
8 rcu_reader& operator=(rcu_reader&& other) noexcept;
9 ~rcu_reader() noexcept;
10 void swap(rcu_reader& other) noexcept;
11 void lock() noexcept;
12 void unlock() noexcept;
13 };
\end{verbatim}
}
\caption{RAII RCU Readers}
\label{fig:RAII RCU Readers}
\end{figure}
The \co{rcu_reader} class shown in
Figure~\ref{fig:RAII RCU Readers}
may be used for RAII RCU read-side critical sections, and
satisfies the requirements of \co{BasicLockable}.
An argumentless constructor enters an RCU read-side critical section,
a constructor with an argument of type \co{defer_lock_t} defers
entering a critical section until a later call to a
\co{std::rcu_reader::lock()} member function,
and a constructor taking another \co{rcu_reader} instance
transfers the RCU read-side critical section from that instance
to the newly constructed instance.
The assignment operator may be used to transfer an RCU read-side critical
section from another \co{rcu_reader} instance to an already-constructed
instance.
This class is intended to be used in a manner similar to \co{std::lock_guard},
so the destructor exits the RCU read-side critical section.
This implementation is movable but not copyable, which enables
an RCU read-side critical section's scope to be arbitrarily
extended across function boundaries via \co{std::forward} and
\co{std::move}.
The \co{std::rcu_reader::swap()} member function swaps the roles
of a pair of \co{rcu_reader} instances in the expected manner.
Finally, the \co{std::rcu_reader::lock()} member function enters an
RCU read-side critical section and the \co{std::rcu_reader::unlock()}
member function exits its critical section.
Invoking \co{std::rcu_reader::lock()} on an instance already corresponding
to an RCU read-side critical section and invoking
\co{std::rcu_reader::unlock()} on an instance not corresponding to
an RCU read-side critical section results in undefined behavior.
\section{Retiring RCU-Protected Objects}
\label{sec:Retiring RCU-Protected Objects}
The traditional C-language RCU callback uses address arithmetic
to map from the \co{rcu_head} structure to the enclosing struct,
for example, via the \co{container_of()} macro.
Of course, this approach also works for C++, but this section describes
a more palatable approach that is quite similar to that proposed for
hazard pointers.
Other historical approaches may be found in
Sections~\ref{sec:Pointer To Enclosing Class}
and~\ref{sec:Address Arithmetic}.
\subsection{Retiring: Implementation Experience}
\label{sec:Retiring: Implementation Experience}
Known implementation experience either (1)~links newly retired
objects together or (2)~references them from a set of vectors.
Userspace RCU provides API members that do both: \co{call_rcu()}
links retired objects and \co{defer_rcu()} references them from a
vector.
The main advantage of the linking approach is that the retirement
process can easily and efficiently ensure that no allocations occur on
the retirement path, which is often part of the free path, thus avoiding
out-of-memory deadlocks.
Such deadlocks could otherwise occur if there was no memory available,
and that lack of memory prevented retirement, and in turn preventing freeing.
However, many applications avoid out-of-memory deadlocks by overprovisioning
memory.
For these applications, there is little or no reason to avoid allocating
memory during the process of retiring objects.
Such applications might choose to occasionally allocate vectors to
keep track of newly retired objects.
And this choice brings additional benefits:
\begin{enumerate}
\item The greater cache locality of vectors has been shown to
improve the efficiency of the retirement
process~\cite{MathieuDesnoyers2012URCU}.
\item Removing the need to associate retirement-time objects with
the RCU-protected object also removes any restrictions in
type and inheritance.
For but one example, the vector approach allows retiring an
RCU-protected \co{string} instance.
\item If retirement is infrequent, but there is a very large number
of RCU-protected objects in existence, the vector approach
can offer a smaller memory footprint.
\end{enumerate}
There are a number of ways to manage the vectors:
\begin{enumerate}
\item Fixed per-thread allocation, as is done for the \co{defer_rcu()}
interface in userspace RCU, with a \co{synchronize_rcu()}
invoked internally to \co{defer_rcu()} when the vector fills.
This performs quite well~\cite{MathieuDesnoyers2012URCU}, but
is not appropriate in environments where it is necessary to
retire objects from within RCU read-side critical sections.
\item Provide an initial set of per-thread vectors, allocating more
as these vectors fill, and enqueuing them for reuse after
their element's deleters have been invoked.
\item As above, but also using some mechanism to free vectors that
are not being used.
\item Allocate the retirement vectors as needed during memory allocation,
ensuring that there are enough slots to handle sudden
retirement of all instances of all dynamically allocated
objects.\footnote{
Kudos to Lance Roy for pointing out this possibility.}
The amount of retirement-path allocation can of course be reduced
by enqueuing them for reuse after their element's deleters have
been invoked.
However, please note that this approach assumes
that retirement leads to freeing, and that there are some
rare but important algorithms for which this is not the
case,\footnote{
Especially when using \co{synchronize_rcu()} instead
of \co{call_rcu()}~\cite{GauthamShenoy2006RCUrwlock,SrivatsaSBhat2014RCUrwlock,RanLiu2014PassiveRWLock,PedroRmalhete2015rwlockRCU,McKenney98}.}
with the Linux kernel's \co{rcu_sync} mechanism perhaps being
the most prominent.
The key point is that RCU's function is waiting for readers,
not necessarily reclaiming memory.
Furthermore, C++ allows special-purpose memory allocators to
be created easily, and it is not likely to be practical to
require all of them to interact with RCU.
\end{enumerate}
We expect further implementation experience to uncover additional strategies for
tracking newly retired objects.
In particular, we expect to see hybrid schemes that make use of per-object
space via the \co{retire()} member function and that use vectors to
track object passed to the \co{rcu_retire()} free function.
\subsection{Retiring: Proposed C++ APIs}
\label{sec:Retiring: Proposed C++ APIs}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 template<typename T, typename D = std::default_delete<T>>
2 class std::rcu_obj_base {
3 public:
4 void retire(D d = {}) noexcept;
5 };
6
7 template<typename T, typename D = std::default_delete<T>>
8 void std::rcu_retire(T *p, D d = {});
\end{verbatim}
}
\caption{Retiring RCU-Protected Objects}
\label{fig:Retiring RCU-Protected Objects}
\end{figure}
Bowing to both types of implementation experience, we propose an API that
lends itself to linking retired objects and an API that lends
itself to referencing retired objects via vectors.
Note that a vector-based retirement implementation can simply ignore
any per-object storage, and a linked-list retirement implementation
can simply allocate the required storage on each retire.
Higher-quality implementations can of course provide further optimizations.
The \co{rcu_obj_base} class provides a \co{retire()} method that
takes a deleter,
as shown in
Figure~\ref{fig:Retiring RCU-Protected Objects}.
This class contains any storage required by that deleter, so that
the implementation of
\co{std::rcu_obj_base<T,D>::retire()} never needs to allocate
memory on the retire/free path.
The deleter's \co{operator()} is invoked after a grace period.
The deleter type defaults to \co{std::default_delete<T>},
but one could also use a
custom functor class with an \co{operator()} that carries out teardown actions
before freeing the object, or a raw function pointer type such as
\co{void(*)(T*)}.
Given sufficient type erasure and reconstitution, the \co{call_rcu()}
C-language free function from the userspace RCU library can be used to
implement \co{retire()}.
We recommend avoiding deleter types such as \co{std::function<void(T*)>}
(and also any other type requiring memory allocation) because
allocating memory on the free path can result in out-of-memory deadlocks,
but we nevertheless recognize that C++ applications that assume ample
memory might use such deleters for convenience.
However, such users are better served by the \co{std::rcu_retire()}
templated free function shown on lines~7-8 of
Figure~\ref{fig:Retiring RCU-Protected Objects}.
Although this function can allocate on the retire/free path, high-quality
implementations will take steps to ensure that such allocation is very
rare, for example, by pre-allocating sufficient storage to avoid
such allocation in the common case.
Simple implementations can instead provide a trivial \co{std::rcu_retire()}
function that is a thin wrapper around \co{std::rcu_obj_base::retire()}.
Implementation experience has shown that high-quality implementations
of \co{std::rcu_retire()} can achieve better cache locality than can
high-quality implementations of \co{std::rcu_obj_base::retire()},
which results in better performance and
scalability.\footnote{
Note that this implementation experience is not limited to
Google~\cite{MathieuDesnoyers2012URCU}.}
Users wishing the best performance and scalability will therefore tend
to prefer \co{std::rcu_retire()} over \co{std::rcu_obj_base::retire()}.
Finally, \co{std::rcu_retire()} is non-intrusive.
This means that (for example) an object of type \co{std::string} can be
passed to \co{std::rcu_retire()}.
In contrast, \co{std::rcu_obj_base::retire()} can only be passed types
related to \co{std::rcu_obj_base}.
\section{RCU Updaters}
\label{sec:RCU Updaters}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 void synchronize_rcu() noexcept;
2 void rcu_barrier() noexcept;
\end{verbatim}
}
\caption{RCU Updaters}
\label{fig:RCU Updaters}
\end{figure}
RCU updaters can use the free functions shown in
Figure~\ref{fig:RCU Updaters}.
The \co{<rcu>} header provides a \co{synchronize_rcu()} free function,
which maps to the C-language \co{synchronize_rcu()} function, which
waits for all pre-existing RCU readers to complete.
This header also provides the \co{rcu_barrier()} free function,
which maps to the C-language \co{rcu_barrier()} function,
which waits for all pending \co{retire()} and \co{rcu_retire()}
deleters to be invoked.
Implementation experience thus far indicates that both of these functions must
scale well, but that it is OK for them to have significant latency.
In fact, the significant latency helps reduce per-request overhead
by allowing concurrent callers' requests to be satisfied by the same
underlying operation.
Use cases that might otherwise require \co{synchronize_rcu()} to
have lower latency should instead use one of the retire APIs, as
these retire APIs impose extremely low latencies on their callers.
The \co{rcu_barrier()} API is used when removing code or data structures
that are used by the deleters passed to \co{call_rcu()}.
In such cases, it is necessary to wait for any outstanding deleters to
complete before freeing any code or data that those deleters might
make accesses to.
\section{Hazard Pointers and RCU: Which to Use?}
\label{sec:Hazard Pointers and RCU: Which to Use?}
\begin{sidewaystable*}
\small
\centering
\begin{tabular}{p{1.2in}|p{1.2in}|p{1.2in}|p{1.2in}|p{1.2in}}
& Reference Counting
& \raggedright Reference Counting with DCAS
& RCU
& Hazard Pointers \\
\hline
\hline
Unreclaimed objects
& \cellcolor{green!30}{\bf Bounded}
& \cellcolor{green!30}{\bf Bounded}
& Unbounded
& \cellcolor{green!30}{\bf Bounded} \\
\hline
\raggedright
Contention among readers
& Can be very high
& Can be very high
& \cellcolor{green!30}{\bf No contention}
& \cellcolor{green!30}
{\bf No contention} \\
\hline
\raggedright
Traversal forward progress
& Either blocking or lock-free with limited reclamation
& \cellcolor{green!30} {\bf Lock free}
& \cellcolor{blue!20}
{\bf Bounded population oblivious wait-free}
& \cellcolor{green!30}{\bf Lock-free} \\
\hline
\raggedright
Reclamation forward progress $^*$
& Either blocking or lock-free with limited reclamation
& \cellcolor{green!30} {\bf Lock free}
& Blocking
& \cellcolor{blue!20}
{\bf Bounded wait-free} \\
\hline
Traversal speed
& Atomic read-modify-write updates
& Atomic read-modify-write updates
& \cellcolor{green!30}{\bf No or low overhead}
& Store-load fence \\
\hline
Reference acquisition
& \cellcolor{green!30}{\bf Unconditional}
& \cellcolor{green!30}{\bf Unconditional}
& \cellcolor{green!30}{\bf Unconditional}
& Conditional \\
\hline
\raggedright
Automatic reclamation
& \cellcolor{green!30}{\bf Yes}
& \cellcolor{green!30}{\bf Yes}
& No
& No \\
\hline
Purpose of domains
& N/A
& N/A
& Isolate long-latency readers
& Limit contention, reduce space
bounds, etc. \\
\end{tabular}
\caption{Comparison of Deferred-Reclamation Mechanisms}
\label{tab:Comparison of Deferred-Reclamation Mechanisms}
\flushleft
\noindent
*~~Does not include memory allocator, just the reclamation itself.
\end{sidewaystable*}
Table~\ref{tab:Comparison of Deferred-Reclamation Mechanisms}
provides a rough summary of the relative advantages of reference
counting, RCU, and hazard pointers.
Advantages are marked in bold with green background, or with a blue
background for strong advantages.
Although reference counting has normally had quite limited capabilities
and been quite tricky to apply for general linked data-structure
traversal, given a double-pointer-width compare-and-swap instruction,
it can work quite well, as shown in the ``Reference Counting with DCAS''
column.
As a rough rule of thumb, for best performance and scalability, you
should use RCU for read-intensive workloads and hazard pointers for
workloads that have significant update rates.
As another rough rule of thumb, a significant update rate has updates
as part of more than 10\% of its operations.
Reference counting with DCAS is well-suited for small systems and/or
low read-side contention, and particularly on systems that have limited
thread-local-storage capabilities.
Both RCU and reference counting with DCAS allow unconditional reference
acquisition.
Specialized workloads will have other considerations.
For example, small-memory multiprocessor systems might be best-served by
hazard pointers, while the read-mostly data structures in real-time
systems might be best-served by RCU.
The relationship between the Hazard Pointers proposal and this RCU
proposal is as follows:
\begin{enumerate}
\item The \co{hazptr_obj_base} class is analogous to \co{rcu_obj_base}.
\item There is no RCU counterpart to \co{hazptr_domain}, in part because
RCU does not explicitly track read-side references to specific
objects.
\item The private \co{hazptr_obj} class is analogous to the pre-existing
\co{rcu_head} struct used in many RCU implementations.
Because this class is an implementation detail, there is no
need to have compatible names.
\item There is no RCU class analogous to \co{hazptr_rec} because
RCU does not track (or need to track) references to individual
RCU-protected objects.
\item There is no hazard pointers counterpart to the \co{rcu_reader}
class.
This is because hazard pointers does not have (or need)
a counterpart to \co{rcu_read_lock()} and \co{rcu_read_unlock()}.
\item There is no hazard pointers counterpart to the \co{rcu_retire()}
templated free function because no hazard-pointers user has
expressed a need for it.
\end{enumerate}
\section{Summary}
\label{sec:Summary}
This paper demonstrates a way of creating C++ bindings for a C-language
RCU implementation, which has been tested against the userspace RCU
library.
Specifically, this document proposes
\co{rcu_reader} (Figure~\ref{fig:RAII RCU Readers}),
\co{rcu_obj_base} (Figure~\ref{fig:Retiring RCU-Protected Objects}), and
a few free functions (Figures~\ref{fig:Retiring RCU-Protected Objects}
and~\ref{fig:RCU Updaters}).
We believe that these bindings are also appropriate for the type-oblivious
C++ RCU implementations that information-hiding considerations are likely
to favor.
\section*{Acknowledgments}
We owe thanks to Pedro Ramalhete for his review and comments.
We are grateful to Jim Wasko for his support of this effort.
%\input{acknowledgments}
%\input{legal}
\appendix
\clearpage
This appendix contains historical proposals and directions.
As such, it is strictly informational.
\section{RCU Domains}
\label{sec:RCU Domains}
All the RCU implementations we are aware of started with a single
domain, and many still support only one domain.
Where domains were added, they were added for two reasons:
\begin{enumerate}
\item To allow alternative implementations with different design
tradeoffs, for example, the ``bottom half'' variant of
Linux-kernel RCU was added to allow Linux-kernel networking
to better defend against network-based denial-of-service
attacks.
\item In the case of Linux-kernel sleepable RCU (SRCU), to
isolate different SRCU users from each other, so that
one user having unusually long SRCU read-side critical
sections would not delay the grace periods of other users.
\end{enumerate}
It seems likely that the advent of a low-latency \co{sys_membarrier()}
system call will permit almost all userspace RCU use cases to be
addressed with a single implementation, so the first reason seems
unlikely to apply to Linux systems in the short term.
Note that Windows has had similar functionality for some time, and
\co{sys_membarrier()} is quite simple, so can be easily added to
other systems as needed.
In the meantime, systems lacking \co{sys_membarrier()} can use one of
the several userspace-RCU algorithms not relying on it.
As to the second reason, SRCU did not appear until some years after
RCU was first accepted into the Linux kernel, and it was many years
before SRCU was used at all heavily.
Even today, there are more than an order of magnitude more uses
of RCU than of SRCU.
Furthermore, many of the recent SRCU uses were motivated by the
accidental fact that SRCU grace periods are shorter than
non-expedited RCU grace periods, and unlike expedited RCU
grace periods, SRCU grace periods do not degrade real-time
response.
In addition, initial uses of RCU in C++ appear to be targeted towards
performance rather than abstraction (as expected), and performance use
cases tend to focus on short RCU read-side critical sections and low
update rates, each of which make domains less useful.
Initial uses of RCU in C++ also seem to be focused solely on
memory reclamation, and this use case can tolerate longer grace
periods than can the more esoteric non-reclamation RCU use cases,
again making RCU domains less useful.
Finally, use of domains can reduce RCU's update-side efficiency.
To see this, note that production-quality RCU implementations
can serve many thousands of grace-period requests with a single
grace period~\cite{Sarma04c}, reducing the per-request grace-period
overhead to nearly zero.
Introducing domains increases the per-request overhead because each
domain needs its own grace-period computation.
Therefore, introduction of domains is an expensive step that should
not be taken without a strong and urgent need.
In time, it is quite possible that a strong and urgent need for C++
RCU domains will appear.
However, we should start with a very simple non-domain API for the
initial TS, and drive any proposals for the addition of domains from
actual experience with both uses and implementations.
Nevertheless, the following sections describe some historical proposals
for RCU C++ domains.
These were illustrated using the multiple userspace-RCU implementations,
but were also intended to address the potential need for isolation of
one user's readers from other users' grace periods.
\subsection{Compile-Time Domain Selection}
\label{sec:Compile-Time Domain Selection}
The quiescent-state based reclamation (QSBR) implementation is intended
for standalone applications where the developers have full control
over the entire application, and where extreme read-side performance
and scalability is required.
Applications use \co{#include "urcu-qsbr.hpp"} to select QSBR and
\co{-lurcu -lurcu-qsbr} to link to it.
These applications must use \co{rcu_register_thread()} and
\co{rcu_unregister_thread()} to announce the coming and going
of each thread that is to execute \co{rcu_read_lock()} and
\co{rcu_read_unlock()}.
They must also use \co{rcu_quiescent_state()}, \co{rcu_thread_offline()},
and \co{rcu_thread_online()} to announce quiescent states to RCU.
The memory-barrier implementation is intended for applications that
can announce threads (again using \co{rcu_register_thread()} and
\co{rcu_unregister_thread()}), but for which announcing quiescent states is
impractical.
Such applications use \co{#include "urcu-mb.hpp"} and
\co{-lurcu-mb} to select the memory-barrier implementation.
Such applications will incur the overhead of a full memory barrier in
each call to \co{rcu_read_lock()} and \co{rcu_read_unlock()}.
The signal-based implementation represents a midpoint between the QSBR
and memory-barrier implementations.
Like the memory-barrier implementation, applications must announce
threads, but need not announce quiescent states.
On the one hand, readers are almost as fast as in the QSBR implementation,
but on the other applications must give up a signal to RCU, by default
\co{SIGUSR1}.
Such applications use \co{#include "urcu-signal.hpp"} and
\co{-lurcu-signal} to select signal-based RCU.
So-called ``bullet-proof RCU'' avoids the need to announce either threads
or quiescent states, and is therefore the best choice for use by
libraries that might well be linked with RCU-oblivious applications.
The penalty is that \co{rcu_read_lock()} incurs both a memory barrier
and a test and \co{rcu_read_unlock()} incurs a memory barrier.
Such applications or libraries use \co{#include urcu-bp.hpp} and
\co{-lurcu-bp}.
\subsection{Run-Time Domain Selection}
\label{sec:Run-Time Domain Selection}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 class rcu_domain {
2 public:
3 constexpr explicit rcu_domain() noexcept { };
4 rcu_domain(const rcu_domain&) = delete;
5 rcu_domain(rcu_domain&&) = delete;
6 rcu_domain& operator=(const rcu_domain&) = delete;
7 rcu_domain& operator=(rcu_domain&&) = delete;
8 virtual void register_thread() = 0;
9 virtual void unregister_thread() = 0;
10 static constexpr bool register_thread_needed() { return true; }
11 virtual void quiescent_state() noexcept = 0;
12 virtual void thread_offline() noexcept = 0;
13 virtual void thread_online() noexcept = 0;
14 static constexpr bool quiescent_state_needed() { return false; }
15 virtual void read_lock() noexcept = 0;
16 virtual void read_unlock() noexcept = 0;
17 virtual void synchronize() noexcept = 0;
18 virtual void retire(rcu_head *rhp, void (*cbf)(rcu_head *rhp)) = 0;
19 virtual void barrier() noexcept = 0;
20 };
\end{verbatim}
}
\caption{RCU Domain Base Class}
\label{fig:RCU Domain Base Class}
\end{figure}
Figure~\ref{fig:RCU Domain Base Class}
shows the abstract base class for runtime selection of RCU domains.
Each domain creates a concrete subclass that implements its RCU APIs:
\begin{itemize}
\item Bullet-proof RCU: \co{class rcu_bp}
\item Memory-barrier RCU: \co{class rcu_mb}
\item QSBR RCU: \co{class rcu_qsbr}
\item Signal-based RCU: \co{class rcu_signal}
\end{itemize}
Of course, additional implementations of RCU may be constructed by
deriving from \co{rcu_domain} and/or by implementing the API
shown in
Figure~\ref{fig:Existing C-Language RCU API}.
\section{Historical RCU-Protected Retirement Plans}
\label{sec:Historical RCU-Protected Retirement Plans}
This section lists alternative methods of retiring RCU-protected
objects.
These might prove helpful for some use cases, and can be implemented
in terms of the intrusive approach described in
Section~\ref{sec:Retiring RCU-Protected Objects}.
\subsection{Pointer To Enclosing Class (Informational Only)}
\label{sec:Pointer To Enclosing Class}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 template<typename T>
2 class rcu_head_ptr: public rcu_head {
3 public:
4 rcu_head_ptr()
5 {
6 this->container_ptr = nullptr;
7 }
8
9 rcu_head_ptr(T *containing_class)
10 {
11 this->container_ptr = containing_class;
12 }
13
14 static void trampoline(rcu_head *rhp)
15 {
16 T *obj;
17 rcu_head_ptr<T> *rhdp;
18
19 rhdp = static_cast<rcu_head_ptr<T> *>(rhp);
20 obj = rhdp->container_ptr;
21 if (rhdp->callback_func)
22 rhdp->callback_func(obj);
23 else
24 delete obj;
25 }
26
27 void retire(void callback_func(T *obj) = nullptr)
28 {
29 this->callback_func = callback_func;
30 call_rcu(static_cast<rcu_head *>(this), trampoline);
31 }
32
33 void retire(class rcu_domain &rd,
34 void callback_func(T *obj) = nullptr)
35 {
36 this->callback_func = callback_func;
37 rd.retire(static_cast<rcu_head *>(this), trampoline);
38 }
39
40 private:
41 void (*callback_func)(T *obj);
42 T *container_ptr;
43 };
\end{verbatim}
}
\caption{RCU Callbacks: Pointer (Informational Only)}
\label{fig:RCU Callbacks: Pointer}
\end{figure}
If complex inheritance networks make inheriting from an
\co{rcu_head} derived type impractical, one alternative is
to maintain a pointer to the enclosing class as shown in
Figure~\ref{fig:RCU Callbacks: Pointer}.
This \co{rcu_head_ptr} class is included as a member of the RCU-protected
class.
The \co{rcu_head_ptr} class's pointer must be initialized, for example,
in the RCU-protected class's constructor.
If the RCU-protected class is \co{foo} and the name of the
\co{rcu_head_ptr} member function is \co{rh}, then
\co{foo1.rh.retire(my_cb)} would cause the function \co{my_cb()} to be
invoked after the end of a subsequent grace period.
As with the previous classes, omitting the deleter results
in the object being passed to \co{delete} and an \co{rcu_domain}
object may be specified.
Please note that this section is informational only: This approach
is \emph{not} being proposed for standardization.
\subsection{Address Arithmetic (Informational Only)}
\label{sec:Address Arithmetic}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 template<typename T>
2 class rcu_head_container_of {
3 public:
4 static void set_field(const struct rcu_head T::*rh_field)
5 {
6 T t;
7 T *p = &t;
8
9 rh_offset = ((char *)&(p->*rh_field)) - (char *)p;
10 }
11
12 static T *enclosing_class(struct rcu_head *rhp)
13 {
14 return (T *)((char *)rhp - rh_offset);
15 }
16
17 private:
18 static inline size_t rh_offset;
19 };
20
21 template<typename T>
22 size_t rcu_head_container_of<T>::rh_offset;
\end{verbatim}
}
\caption{RCU Callbacks: Address Arithmetic (Informational Only)}
\label{fig:RCU Callbacks: Address Arithmetic}
\end{figure}
Figure~\ref{fig:RCU Callbacks: Address Arithmetic}
shows an approach that can be used if memory is at a premium and
the inheritance techniques cannot be used.
The \co{set_field()} method sets the offset of the
\co{rcu_head_container_of} member within the enclosing RCU-protected