Skip to content

Commit

Permalink
Compare always checking + checking upfront
Browse files Browse the repository at this point in the history
This is to help emphasize the question "Why even bother with IFUNC"?
How much time does it really save?
  • Loading branch information
robertdfrench committed Jul 19, 2024
1 parent 3ef2240 commit 63ad2ff
Show file tree
Hide file tree
Showing 4 changed files with 156 additions and 25 deletions.
12 changes: 11 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ help: $(MAKEFILE_LIST) #: Display this Help menu
| column -t -s':' \
| sort

all: cpu_demo speed_demo tty_demo #: Run all IFUNC demos
rll: cpu_demo speed_demo tty_demo #: Run all IFUNC demos
@echo ""
test: all
check: all
Expand All @@ -26,8 +26,18 @@ rigorous_speed_demo: clean speed_demo_fixed.stats.txt speed_demo_ifunc.stats.txt
$(call banner, Final Results)
@echo "TEST LOW HIGH AVG"
@printf "fixed\t"; cat speed_demo_fixed.stats.txt
@printf "pointer\t"; cat speed_demo_pointer.stats.txt
@printf "ifunc\t"; cat speed_demo_ifunc.stats.txt
@echo ""

super_rigorous_speed_demo: clean speed_demo_fixed.stats.txt speed_demo_ifunc.stats.txt speed_demo_pointer.stats.txt speed_demo_always.stats.txt speed_demo_upfront.stats.txt #: Really, how slow is it?
$(call banner, Final Results)
@echo "TEST LOW HIGH AVG"
@printf "fixed\t"; cat speed_demo_fixed.stats.txt
@printf "pointer\t"; cat speed_demo_pointer.stats.txt
@printf "ifunc\t"; cat speed_demo_ifunc.stats.txt
@printf "upfront\t"; cat speed_demo_upfront.stats.txt
@printf "always\t"; cat speed_demo_always.stats.txt
@echo ""

%.stats.txt: %.low.txt %.high.txt %.avg.txt
Expand Down
77 changes: 53 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,13 +237,13 @@ see how much overhead *ifunc itself* causes. After all, any function worth
optimizing is probably called frequently, so the overhead of the function
invocation is worth acknowledging.

To figure this out, I designed an experiment that would call an *dynamically
To figure this out, I designed an experiment that would call a *dynamically
resolved* function over and over again in a tight loop. Take a look at
[`speed_demo_ifunc.c`](speed_demo_ifunc.c) and
[`speed_demo_pointer.c`](speed_demo_pointer.c). These programs both do the same
work (incrementing a static counter), but the incrementer functions are resolved
in different ways: the former leverages GNU IFUNC, and the latter relies on
plain old function pointers.
[`speed_demo_ifunc.c`](src/speed_demo_ifunc.c) and
[`speed_demo_pointer.c`](src/speed_demo_pointer.c). These programs both do the
same work (incrementing a static counter), but the incrementer functions are
resolved in different ways: the former leverages GNU IFUNC, and the latter
relies on plain old function pointers.

Here is the overall logic:

Expand All @@ -252,7 +252,7 @@ Here is the overall logic:
1. Call this incrementer function a few billion times to get an estimate of its
cost.

As a control, there is also [`speed_demo_fixed.c`](speed_demo_fixed.c) which
As a control, there is also [`speed_demo_fixed.c`](src/speed_demo_fixed.c) which
does the same incrementer work but without any dynamically resolve functions.
This can be used to get a help estimate what part of the runtime is dedicated to
function invocation vs what part is just doing addition.
Expand All @@ -262,23 +262,52 @@ programs and produces some simple statistics about their performance. These
numbers will of course change based on your hardware, but the `fixed` test
should serve as a baseline for comparison.

#### Results
| TEST | LOW | HIGH | AVG |
|---------|------|------|-------|
| fixed | 2.93 | 4.20 | 3.477 |
| ifunc | 9.50 | 10.56| 9.986 |
| pointer | 6.23 | 7.44 | 6.791 |

Granted, ifunc does a lot more than function pointers do, so this is not a fair
comparison. ifunc handles symbol resolution lazily, which makes more sense for
large libraries (like glibc) -- if a library had to resolve all its dynamic
symbols during the loading process, it could cause a measurable performance
penalty even for applications which only need a small portion of those symbols.

But for smaller libraries like xz-utils, there just aren't many symbols that
need to be resolved in this way. Handling any such resolution when the library
is loaded would surely go unnoticed (relative to the cost of loading a library
from disk in the first place).
| *Results* | LOW | HIGH | AVG |
|-----------|------|------|-------|
| fixed | 2.93 | 4.20 | 3.477 |
| ifunc | 9.50 | 10.56| 9.986 |
| pointer | 6.23 | 7.44 | 6.791 |

What we see here is that ifunc has a not-insignificant overhead compared to
using a plain-old function pointer. On average, on my hardware, it takes about
twice as long to call an ifunc function 2 billion times as it does to invoke a
function pointer 2 billion times.

Does this matter in real life? Absolutely not. Functions that are worth
optimizing are much more expensive than the "increment by one" functions that we
are analyzing here. It is interesting because GNU IFUNC claims to be a boon for
performance.

### Performance of Other Techniques
There are other techniques which are slower than ifunc. Take a look at the
`super_rigorous_speed_demo`, which brings to other experiments into play:
[`speed_demo_upfront.c`](src/speed_demo_upfront.c) and
[`speed_demo_always.c`](src/speed_demo_always.c).

`speed_demo_upfront.c` behaves similarly to `speed_demo_pointer.c`, except that
it stores the results of the cpu feature checks in global variables rather than
keeping track of a function pointer. This still requires a "resolver" function
to run first to determine which implementation gets used, based on the value of
these global variables. This technique turns out to be slower than ifunc, but it
is also safer than storing function pointers: whereas function pointers can be
set to arbitrary values, boolean flags cannot. So an attacker able to modify
these variables can make the program *slower*, but cannot make the program behave
*differently*.

`speed_demo_always.c` is designed to be the slowest technique -- it checks all
the necessary CPU features every time an implementation is needed and picks one
on the fly. Curiously, this technique is not significantly slower than anything
else. It is only marginally slower than ifunc in the case where we have just a
single CPU feature to check.

| TEST | LOW | HIGH | AVG |
|---------|------|-------|---------|
| fixed | 5.02 | 5.70 | 5.37 |
| pointer | 6.40 | 7.02 | 6.66 |
| ifunc | 8.56 | 11.11 | 9.64 |
| upfront | 9.24 | 9.41 | 9.33333 |
| always | 10.07| 10.56 | 10.2333 |


## Recap
![Yes, all shared libraries](img/brain.png)
Expand Down
43 changes: 43 additions & 0 deletions src/speed_demo_always.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// This program is part of an experiment to compare the performance of GNU IFUNC
// vs plain-old function pointers. Run `make rigorous_speed_demo` to see a full
// comparison of speeds.
//
// This particular program selects an "appropriate" incrementer function lazily,
// via the GNU IFUNC facility. The choice of incrementer is irrelevant, since
// they are the same; our concern is the cost of invoking the chosen incrementer
// based on what strategy we use to select it.
#include <limits.h>
#include <stddef.h>
static int counter = 0;

// Use this incrementer algorithm if AVX2 is available.
void avx2_incrementer() {
counter += 1;
}

// Use this if AVX2 is not available. It's the same as above, because we don't
// actually rely on AVX2.
void normal_incrementer() {
counter += 1;
}

// This is the ifunc "stub" function. The first time it is called, the
// `resolver` will be invoked in order to select an appropriate "real" function.
// Once the "real" function is selected, its address will be stored in the
// Global Offset Table. When this stub is invoked in the future, the PLT will
// cause the program to jump directly to the selected function.
void increment_counter() {
if (__builtin_cpu_supports("avx2")) {
avx2_incrementer();
} else {
normal_incrementer();
}
}

int main() {
// Count to ~ 2 Billion by calling a dynamically-resolved incrementer
while (counter < INT_MAX) {
increment_counter();
}
return 0;
}
49 changes: 49 additions & 0 deletions src/speed_demo_upfront.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
// This program is part of an experiment to compare the performance of GNU IFUNC
// vs plain-old function pointers. Run `make rigorous_speed_demo` to see a full
// comparison of speeds.
//
// This particular program selects an "appropriate" incrementer function at the
// beginning of main, before any other work is done. The address of the chosen
// function is stored in a static function pointer. The choice of incrementer
// is irrelevant, since they are the same; our concern is the cost of invoking
// the chosen incrementer based on what strategy we use to select it.
#include <limits.h>
#include <stddef.h>
#include <stdbool.h>
static int counter = 0;
static bool cpu_has_avx2 = false;

// Use this incrementer algorithm if AVX2 is available.
void avx2_incrementer() {
counter += 1;
}

// Use this if AVX2 is not available. It's the same as above, because we don't
// actually rely on AVX2.
void normal_incrementer() {
counter += 1;
}

// Select an "appropriate" incrementer based on CPU features. The actual choice
// doesn't matter in this case, we just need something for the resolver to do.
void increment_counter() {
if (cpu_has_avx2) {
avx2_incrementer();
} else {
normal_incrementer();
}
}

void detect_cpu_features() {
cpu_has_avx2 = __builtin_cpu_supports("avx2");
}

int main() {
detect_cpu_features();

// Count to ~ 2 Billion by calling a dynamically-resolved incrementer
while (counter < INT_MAX) {
increment_counter();
}
return 0;
}

0 comments on commit 63ad2ff

Please sign in to comment.