Compare always checking + checking upfront

This is to help emphasize the question "Why even bother with IFUNC"? How much time does it really save?
robertdfrench · Jul 19, 2024 · 63ad2ff · 63ad2ff
1 parent 3ef2240
commit 63ad2ff
Show file tree

Hide file tree

Showing 4 changed files with 156 additions and 25 deletions.
diff --git a/Makefile b/Makefile
@@ -8,7 +8,7 @@ help: $(MAKEFILE_LIST) #: Display this Help menu
 		| column -t -s':' \
 		| sort
 
-all: cpu_demo speed_demo tty_demo #: Run all IFUNC demos
+rll: cpu_demo speed_demo tty_demo #: Run all IFUNC demos
 	@echo ""
 test: all
 check: all
@@ -26,8 +26,18 @@ rigorous_speed_demo: clean speed_demo_fixed.stats.txt speed_demo_ifunc.stats.txt
 	$(call banner, Final Results)
 	@echo "TEST	LOW	HIGH	AVG"
 	@printf "fixed\t"; cat speed_demo_fixed.stats.txt
+	@printf "pointer\t"; cat speed_demo_pointer.stats.txt
 	@printf "ifunc\t"; cat speed_demo_ifunc.stats.txt
+	@echo ""
+
+super_rigorous_speed_demo: clean speed_demo_fixed.stats.txt speed_demo_ifunc.stats.txt speed_demo_pointer.stats.txt speed_demo_always.stats.txt speed_demo_upfront.stats.txt #: Really, how slow is it?
+	$(call banner, Final Results)
+	@echo "TEST	LOW	HIGH	AVG"
+	@printf "fixed\t"; cat speed_demo_fixed.stats.txt
 	@printf "pointer\t"; cat speed_demo_pointer.stats.txt
+	@printf "ifunc\t"; cat speed_demo_ifunc.stats.txt
+	@printf "upfront\t"; cat speed_demo_upfront.stats.txt
+	@printf "always\t"; cat speed_demo_always.stats.txt
 	@echo ""
 
 %.stats.txt: %.low.txt %.high.txt %.avg.txt

diff --git a/README.md b/README.md
@@ -237,13 +237,13 @@ see how much overhead *ifunc itself* causes. After all, any function worth
 optimizing is probably called frequently, so the overhead of the function
 invocation is worth acknowledging.
 
-To figure this out, I designed an experiment that would call an *dynamically
+To figure this out, I designed an experiment that would call a *dynamically
 resolved* function over and over again in a tight loop.  Take a look at
-[`speed_demo_ifunc.c`](speed_demo_ifunc.c) and
-[`speed_demo_pointer.c`](speed_demo_pointer.c).  These programs both do the same
-work (incrementing a static counter), but the incrementer functions are resolved
-in different ways: the former leverages GNU IFUNC, and the latter relies on
-plain old function pointers.
+[`speed_demo_ifunc.c`](src/speed_demo_ifunc.c) and
+[`speed_demo_pointer.c`](src/speed_demo_pointer.c).  These programs both do the
+same work (incrementing a static counter), but the incrementer functions are
+resolved in different ways: the former leverages GNU IFUNC, and the latter
+relies on plain old function pointers.
 
 Here is the overall logic:
 
@@ -252,7 +252,7 @@ Here is the overall logic:
 1. Call this incrementer function a few billion times to get an estimate of its
    cost.
 
-As a control, there is also [`speed_demo_fixed.c`](speed_demo_fixed.c) which
+As a control, there is also [`speed_demo_fixed.c`](src/speed_demo_fixed.c) which
 does the same incrementer work but without any dynamically resolve functions.
 This can be used to get a help estimate what part of the runtime is dedicated to
 function invocation vs what part is just doing addition.
@@ -262,23 +262,52 @@ programs and produces some simple statistics about their performance. These
 numbers will of course change based on your hardware, but the `fixed` test
 should serve as a baseline for comparison.
 
-#### Results
-| TEST    | LOW  | HIGH | AVG   |
-|---------|------|------|-------|
-| fixed   | 2.93 | 4.20 | 3.477 |
-| ifunc   | 9.50 | 10.56| 9.986 |
-| pointer | 6.23 | 7.44 | 6.791 |
-
-Granted, ifunc does a lot more than function pointers do, so this is not a fair
-comparison. ifunc handles symbol resolution lazily, which makes more sense for
-large libraries (like glibc) -- if a library had to resolve all its dynamic
-symbols during the loading process, it could cause a measurable performance
-penalty even for applications which only need a small portion of those symbols.
-
-But for smaller libraries like xz-utils, there just aren't many symbols that
-need to be resolved in this way. Handling any such resolution when the library
-is loaded would surely go unnoticed (relative to the cost of loading a library
-from disk in the first place).
+| *Results* | LOW  | HIGH | AVG   |
+|-----------|------|------|-------|
+| fixed     | 2.93 | 4.20 | 3.477 |
+| ifunc     | 9.50 | 10.56| 9.986 |
+| pointer   | 6.23 | 7.44 | 6.791 |
+
+What we see here is that ifunc has a not-insignificant overhead compared to
+using a plain-old function pointer. On average, on my hardware, it takes about
+twice as long to call an ifunc function 2 billion times as it does to invoke a
+function pointer 2 billion times.
+
+Does this matter in real life? Absolutely not. Functions that are worth
+optimizing are much more expensive than the "increment by one" functions that we
+are analyzing here. It is interesting because GNU IFUNC claims to be a boon for
+performance.
+
+### Performance of Other Techniques
+There are other techniques which are slower than ifunc. Take a look at the
+`super_rigorous_speed_demo`, which brings to other experiments into play:
+[`speed_demo_upfront.c`](src/speed_demo_upfront.c) and
+[`speed_demo_always.c`](src/speed_demo_always.c).
+
+`speed_demo_upfront.c` behaves similarly to `speed_demo_pointer.c`, except that
+it stores the results of the cpu feature checks in global variables rather than
+keeping track of a function pointer. This still requires a "resolver" function
+to run first to determine which implementation gets used, based on the value of
+these global variables. This technique turns out to be slower than ifunc, but it
+is also safer than storing function pointers: whereas function pointers can be
+set to arbitrary values, boolean flags cannot. So an attacker able to modify
+these variables can make the program *slower*, but cannot make the program behave
+*differently*.
+
+`speed_demo_always.c` is designed to be the slowest technique -- it checks all
+the necessary CPU features every time an implementation is needed and picks one
+on the fly. Curiously, this technique is not significantly slower than anything
+else. It is only marginally slower than ifunc in the case where we have just a
+single CPU feature to check. 
+
+| TEST    | LOW  | HIGH  | AVG     |
+|---------|------|-------|---------|
+| fixed   | 5.02 | 5.70  | 5.37    |
+| pointer | 6.40 | 7.02  | 6.66    |
+| ifunc   | 8.56 | 11.11 | 9.64    |
+| upfront | 9.24 | 9.41  | 9.33333 |
+| always  | 10.07| 10.56 | 10.2333 |
+
 
 ## Recap
 ![Yes, all shared libraries](img/brain.png)

diff --git a/src/speed_demo_always.c b/src/speed_demo_always.c
@@ -0,0 +1,43 @@
+// This program is part of an experiment to compare the performance of GNU IFUNC
+// vs plain-old function pointers. Run `make rigorous_speed_demo` to see a full
+// comparison of speeds.
+//
+// This particular program selects an "appropriate" incrementer function lazily,
+// via the GNU IFUNC facility.  The choice of incrementer is irrelevant, since
+// they are the same; our concern is the cost of invoking the chosen incrementer
+// based on what strategy we use to select it.
+#include <limits.h>
+#include <stddef.h>
+static int counter = 0;
+
+// Use this incrementer algorithm if AVX2 is available.
+void avx2_incrementer() {
+	counter += 1;
+}
+
+// Use this if AVX2 is not available. It's the same as above, because we don't
+// actually rely on AVX2. 
+void normal_incrementer() {
+	counter += 1;
+}
+
+// This is the ifunc "stub" function. The first time it is called, the
+// `resolver` will be invoked in order to select an appropriate "real" function.
+// Once the "real" function is selected, its address will be stored in the
+// Global Offset Table. When this stub is invoked in the future, the PLT will
+// cause the program to jump directly to the selected function.
+void increment_counter() {
+	if (__builtin_cpu_supports("avx2")) {
+		avx2_incrementer();
+	} else {
+		normal_incrementer();
+	}
+}
+
+int main() {
+	// Count to ~ 2 Billion by calling a dynamically-resolved incrementer
+	while (counter < INT_MAX) {
+		increment_counter();
+	}
+	return 0;
+}
diff --git a/src/speed_demo_upfront.c b/src/speed_demo_upfront.c
@@ -0,0 +1,49 @@
+// This program is part of an experiment to compare the performance of GNU IFUNC
+// vs plain-old function pointers. Run `make rigorous_speed_demo` to see a full
+// comparison of speeds.
+//
+// This particular program selects an "appropriate" incrementer function at the
+// beginning of main, before any other work is done. The address of the chosen
+// function is stored in a static function pointer.  The choice of incrementer
+// is irrelevant, since they are the same; our concern is the cost of invoking
+// the chosen incrementer based on what strategy we use to select it.
+#include <limits.h>
+#include <stddef.h>
+#include <stdbool.h>
+static int counter = 0;
+static bool cpu_has_avx2 = false;
+
+// Use this incrementer algorithm if AVX2 is available.
+void avx2_incrementer() {
+	counter += 1;
+}
+
+// Use this if AVX2 is not available. It's the same as above, because we don't
+// actually rely on AVX2. 
+void normal_incrementer() {
+	counter += 1;
+}
+
+// Select an "appropriate" incrementer based on CPU features. The actual choice
+// doesn't matter in this case, we just need something for the resolver to do.
+void increment_counter() {
+	if (cpu_has_avx2) {
+		avx2_incrementer();
+	} else {
+		normal_incrementer();
+	}
+}
+
+void detect_cpu_features() {
+	cpu_has_avx2 = __builtin_cpu_supports("avx2");
+}
+
+int main() {
+	detect_cpu_features();
+
+	// Count to ~ 2 Billion by calling a dynamically-resolved incrementer
+	while (counter < INT_MAX) {
+		increment_counter();
+	}
+	return 0;
+}