-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Substantial Performance Regression of Dict operations in Python 3.12.0rc1 versus Python 3.11.4 #109049
Comments
Sorry but I don't think this proves anything. There could be a LOT of reasons why a complicated function slows down with a different interpreter. If this is indeed a performance regression of dictionary, you should be able to reproduce the result with ONLY dictionary operations. The issue itself is more like a speculation. In fact, cProfile itself changed between 3.11 and 3.12, and it's a deterministic profiler which could introduce a signficant overhead if you have many small function calls. Like I said, if you believe this is a dictionary performance regression, you should be able to isolate the dict operation, do it in a loop and stop-watch it - it should show the regression. Otherwise, I don't think this worth investigation from CPython's perspective. |
Five manual, consecutive runs with Python 3.11:
Five manual, consecutive runs with Python 3.12:
I would say the results are of statistical significance and the slowdown in Python 3.12 versus Python 3.11 is evident here. It seems plausible that the root cause is related to memory management, not specific to the dict built-in as I had originally thought: this test case consumes 20GB+ of memory. |
I should check the performance of the other dict methods I referenced in the original submission as there may be multiple issues here. I'll wait for feedback first on these tests. |
Yes, this is observable slowdown. What's the result when the loop is smaller? Or when the memory consumption is smaller (same loop count, but do not accumulate memory that much). It's not a common pattern to accumulate very small pieces of memory to a very large number (not saying that the slowdown does not mean anything). It could be a tradeoff to make common pattern faster. It could be a memory management related reason, but it could also be something else. Not sure who's the expert in this area, but it would be nice to have more information (tests I mentioned above) so we may narrow down the cause. |
Here's a quick response for the case where the loop is 100 times smaller for each datatype (again slowdown observed):
|
Here's the original test case but with immediate removal of the data entry inside the loop either with pop() or del d[key]. In this case there is no growth in memory usage (as expected). It's unclear to me that this tells us much more. The previous tests would suggest we simply have a per operation slowdown.
|
The performance regression seems to take place between 3.12.0a7 and 3.12.0b1. I'm not sure it's a single commit. It may be an accumulation of a number of commits, possibly in the area of sub-interpreter work. |
I can confirm there is some kind of regression. Performing this benchmark:
results in
This is for comparison of 3.11.4 with current main. |
From speed.python.org: The performance regression seems to occur around the merge of #19474, which is known to reduce performance a bit. |
For the bigger picture, my application is in a class of Scientific Computing applications, numerically solving coupled non-linear PDEs and extensively using NumPy, SciPy, H5Py, and sparse matrix linear algebra. The 3.11.4 and 3.12.0rc2 installations use identical Python module versions (NumPy, SciPy, H5Py/HDF5 built from source for each Python version). These are my performance test results, benchmarked against Python 3.8:
As expected, we see substantial performance improvements for Python 3.11 over Python 3.8. However, Python 3.12.0rc2 appears to give up much of these gains. |
I did some more tests with publicly available packages (sympy, lmfit and lark) and compared python 3.10, 3.11 and 3.12. The results are similar to the OP: a significant performance improvement from 3.10 to 3.11, but a regression from 3.11 to 3.12 for two out of the three test cases. For sympy the performance loss is about 10%.
Full code for benchmarksSource was downloaded from python.org and compiled with
Benchmark script
Update: I benchmarked sympy for ea2c001 (the immortal instances PR) and 916de04 (the commit before) |
I've found this issue while noticing that our CI tests occasionally timeout at GitHub on Python 3.12 while they worked fine on 3.11. We did see performance gains for several Python releases, but 3.12 seems to go back to the performance of 3.9 (at least CI wise). Slowdown can be seen on other projects as well. For example, the above-mentioned sympy has some parts of the test suite (they run it split into 4 parts) 20% slower on Python 3.12 compared to 3.11, Django tests seem about 10% slower. I know that CI timing can vary a lot, but the regression seems to happen across projects, and this issue shows some synthetic benchmarks as well. |
The user sahnehaeubchen linked this issue to PEP 683 in this discussion, so far I don't see the possible connection talked about here. I don't know what they are basing the claim on, but it seems plausible to me that an additional branch on every object refcount check could have wide ranging impacts. The PEP claims:
I could image that the benchmarks performed might not account for branch miss-prediction due to BTB misses and generally larger code-sizes that don't fit into i-cache L1 and L0/1 BTB and other cache effects of larger programs that affect branch prediction cost. |
Hey, just catching up here! Thanks for the ping @ericsnowcurrently - following on the above there is a link on the regression after the implementation of PEP683. There are benchmarks that perform better and benchmarks that perform worse. For the full list, take a look here which shows a ~1.02x regression on the geometric average. One thing to add though, the build here uses gcc/g++ 9.4 which I empirically found to perform slightly worse with PEP683. Instead, I’d recommend trying this out again with GCC 11.4 (or LLVM 15+) which seems to fare much better with PEP683 than GCC 9.4. Sorry for the delay in replies and happy to answer any follow-up questions 🙂 |
Some time has gone by and my code has evolved somewhat but here's a comparison of GCC compilers for a shorter version of the test profiled at the beginning of this issue. Again, the build is on Ubuntu 20.04 and CPython, NumPy 1.26.0, SciPy 1.11.2, HDF5 1.14.4-3, H5Py 3.11 are all built from source using the indicated compiler. CPython is built using I have reset the timing reference to make Python 3.11.9+ the baseline for the comparison. The test case is a linear simulation of a circuit of resistors with 10,000+ unknowns and a lot of subcircuit hierarchy. A heavy burden is placed on networkx DiGraph() operations which is written in Python and extensively uses dictionaries.
We again see the slowdown, moving from Python 3.11 to Python 3.12, with GCC 9.4. However, performance drops further when switching to GCC 11.4, a result not consistent with the findings of @eduardo-elizondo and @ericsnowcurrently. Strangely, performance is further degraded when switching to GCC 13.1. |
Benchmarking is tricky and what scenarios are considered will shape the results. I believe @eduardo-elizondo and @ericsnowcurrently that in their testing they didn't see significant regressions. But from a higher level there is no way that there aren't a substantial of usage+cpu-uarch combinations where adding a branch instruction to every object refcount check doesn't cause perf regressions. The CPU pipeline can't always hide the extra work of the additional instruction and BTBs don't yield perfect no-latency results. Modern designs tend to be capable of zero-bubble branch prediction, but there are many CPUs out in the wild where that isn't the case and any branch will cause a couple of cycles of latency, which may or may not be hidable by the pipeline. And even big cpu-uarchs have limited BTB caching capacity, given enough branches in play at some point the pipeline will stall because it has to resteer. Overall I wonder if the rather niche use-case enabled by PEP 683 is worth the wide-spread perf regressions caused by it. |
@chrisgmorton Thanks for the followup on the benchmarks, quite interesting! In particular I am interested in the following: is the slowdown moving from gcc 9 to 11 and 13 also present for python 3.11.9? (e.g. independent from the PEP 683 work). If so, we can create a separate issue for this since it would be a roughly 5% slowdown when moving to a more recent compiler. |
@eendebakpt, here's the analysis for different compilers with Python 3.11. I ran the GCC 9.4.0 results again, so we have runs done with the computer in approximately the same state. For the record, I'm using a Dell laptop with an Intel Core i9-10885H CPU, which has 16 Logical Processors (8 Cores), and 64GB RAM.
Clearly, the slowdown by compiler is specific to Python 3.12. As we have both found, the slowdown in Python 3.12 compared to Python 3.11 reported here is not exclusively caused by the Immortal Objects PEP 683. Other check-ins between a7 and b1 of Python 3.12 contribute to the slowdown. The slowdown in Python 3.12 with later versions of GCC compiler also may or may not be connected to the root cause of the slowdown discussed in the core of this issue report. |
TLDR: unrelated issue with standard library, moved to #128641On version 3.13 this problem became even more noticeable. I have a function that parses hundreds of small files using I made 3 versions of the program: And measured their performance using timeit:
Performance regression is 2-6 times between 3.11 and 3.13. I'm attaching the files of the original routine and the threaded version. The code is for Linux, BSD and provides a method to find installed browsers for webbrowser. |
@2trvl Is |
@eendebakpt I exported all paths using glob to a list. Then copy-pasted it into a new module and started reading it with ConfigParser. Between 3.11/3.12 and 3.13 the time difference is again 2x even on a small number of files. $ python3.11 -m timeit -s "import nixconfig" "nixconfig.main()"
50 loops, best of 5: 5.27 msec per loop
$ python3.12 -m timeit -s "import nixconfig" "nixconfig.main()"
50 loops, best of 5: 5.33 msec per loop
$ python3.13 -m timeit -s "import nixconfig" "nixconfig.main()"
20 loops, best of 5: 11.2 msec per loop I also made sure that |
@2trvl Thanks. Could you share the |
Thanks, the If I understand correctly the |
@eendebakpt You can generate them on Linux using nixexport. Yes, I'll start profiling ConfigParser tomorrow and try to find specific statements that are causing the slowdown. UPD:I found out that it is not the interpreter problem, but the following configparser change 019143f. Sorry for bothering you, I will create a separate issue. |
Ok. Make sure you create a minimal reproducer before opening the issue. Closing as not-a-bug. |
@erlend-aasland, can you explain why this issue has been closed? The performance degradation of cpython post 3.11 is quite worrisome, especially given the focus on perf improvements. |
@chrisgmorton, I misinterpreted the digression started with #109049 (comment) to be the core issue; my bad. Reopening; thanks for the heads-up. |
Bug report
Bug description:
Comparing Python 3.11.4 with 3.12.0rc1 I see a substantial slowdown of dictionary operations (update, copy, items, get). Build is from source on Ubuntu 20.04 using gcc/g++ 9.4.0, with configure options --enable-shared and --enable-optimizations. The NumPy version in both cases is 1.25.2 with linkage to mkl 2023.2.0.
Profiling results for my application are as follows:
Python 3.12.0rc1:
Python 3.11.4:
The slowdowns in networkx digraph class methods add_edges_from(), in particular, and add_nodes_from() are likely caused by the performance degradation of the python dictionary methods.
CPython versions tested on:
3.12
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: