Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substantial Performance Regression of Dict operations in Python 3.12.0rc1 versus Python 3.11.4 #109049

Open
chrisgmorton opened this issue Sep 7, 2023 · 29 comments
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-subinterpreters type-bug An unexpected behavior, bug, or error

Comments

@chrisgmorton
Copy link

chrisgmorton commented Sep 7, 2023

Bug report

Bug description:

Comparing Python 3.11.4 with 3.12.0rc1 I see a substantial slowdown of dictionary operations (update, copy, items, get). Build is from source on Ubuntu 20.04 using gcc/g++ 9.4.0, with configure options --enable-shared and --enable-optimizations. The NumPy version in both cases is 1.25.2 with linkage to mkl 2023.2.0.

Profiling results for my application are as follows:
Python 3.12.0rc1:

Wed Sep  6 18:55:23 2023    profile.binfile

         27271926 function calls (26072328 primitive calls) in 20.862 seconds

   Ordered by: internal time
   List reduced from 520 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7045    4.000    0.001    5.031    0.001 digraph.py:469(add_nodes_from)
     7045    3.003    0.000    5.198    0.001 digraph.py:713(add_edges_from)
  7056013    1.952    0.000    2.093    0.000 {method 'update' of 'dict' objects}
     1765    1.304    0.001    1.351    0.001 core.py:7090(concatenate)
  1232480    0.752    0.000    1.121    0.000 reportviews.py:788(<genexpr>)
    91026    0.699    0.000    1.069    0.000 resolver.py:92(Define)
   203489    0.645    0.000    1.167    0.000 core.py:2952(_update_from)
    41601    0.568    0.000    0.581    0.000 device.py:157(Default)
    41601    0.479    0.000    0.917    0.000 mnaindexer.py:238(SetIndexing)
        2    0.405    0.203    0.469    0.234 mnaloader.py:95(SetLinearTerms)
  1495228    0.326    0.000    0.326    0.000 {method 'copy' of 'dict' objects}
    83202    0.320    0.000    0.558    0.000 device.py:40(SetSocket)
  1496872    0.311    0.000    0.311    0.000 {method 'items' of 'dict' objects}
  1626268    0.308    0.000    0.308    0.000 {method 'get' of 'dict' objects}

Python 3.11.4:

Wed Sep  6 18:54:04 2023    profile.binfile

         27569104 function calls (26369506 primitive calls) in 16.836 seconds

   Ordered by: internal time
   List reduced from 541 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7045    4.409    0.001    4.802    0.001 digraph.py:469(add_nodes_from)
     7045    1.576    0.000    2.972    0.000 digraph.py:713(add_edges_from)
     1765    1.337    0.001    1.380    0.001 core.py:7090(concatenate)
  1232480    0.767    0.000    0.961    0.000 reportviews.py:788(<genexpr>)
  7056013    0.765    0.000    0.881    0.000 {method 'update' of 'dict' objects}
    91026    0.520    0.000    0.787    0.000 resolver.py:92(Define)
   203489    0.486    0.000    0.721    0.000 core.py:2952(_update_from)
    41601    0.464    0.000    0.464    0.000 device.py:37(<dictcomp>)
    41601    0.369    0.000    0.704    0.000 mnaindexer.py:238(SetIndexing)
        2    0.331    0.165    0.413    0.206 mnaloader.py:95(SetLinearTerms)
    83202    0.312    0.000    0.402    0.000 device.py:40(SetSocket)
    93800    0.270    0.000    1.559    0.000 core.py:3217(__getitem__)
    43368    0.206    0.000    1.698    0.000 resolver.py:233(Connect)
1332528/332816    0.186    0.000    0.187    0.000 cell.py:20(GetDevices)
   267312    0.182    0.000    0.182    0.000 {built-in method numpy.array}
   104392    0.178    0.000    0.722    0.000 core.py:2978(__array_finalize__)
        1    0.148    0.148    0.921    0.921 mnamethod.py:276(SetLinearDeviceRulesIndexing)
   351240    0.129    0.000    0.839    0.000 {function MaskedArray.view at 0x7fc262658540}
  1227200    0.112    0.000    0.112    0.000 reportviews.py:774(<lambda>)
   1765/1    0.110    0.000   12.163   12.163 resolver.py:350(Build)
  1663235    0.109    0.000    0.109    0.000 {built-in method builtins.getattr}
    41601    0.105    0.000    0.162    0.000 mnaindexer.py:268(SetMatrixConstructionIndexing)
  1496872    0.101    0.000    0.101    0.000 {method 'items' of 'dict' objects}
  1643874    0.099    0.000    0.099    0.000 {method 'get' of 'dict' objects}
  5292/12    0.098    0.000    0.743    0.062 cell.py:275(SetParameters)
  1495228    0.098    0.000    0.098    0.000 {method 'copy' of 'dict' objects}

The slowdowns in networkx digraph class methods add_edges_from(), in particular, and add_nodes_from() are likely caused by the performance degradation of the python dictionary methods.

CPython versions tested on:

3.12

Operating systems tested on:

Linux

@chrisgmorton chrisgmorton added the type-bug An unexpected behavior, bug, or error label Sep 7, 2023
@gaogaotiantian
Copy link
Member

gaogaotiantian commented Sep 7, 2023

Sorry but I don't think this proves anything.

There could be a LOT of reasons why a complicated function slows down with a different interpreter. If this is indeed a performance regression of dictionary, you should be able to reproduce the result with ONLY dictionary operations. The issue itself is more like a speculation.

In fact, cProfile itself changed between 3.11 and 3.12, and it's a deterministic profiler which could introduce a signficant overhead if you have many small function calls.

Like I said, if you believe this is a dictionary performance regression, you should be able to isolate the dict operation, do it in a loop and stop-watch it - it should show the regression. Otherwise, I don't think this worth investigation from CPython's perspective.

@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 10, 2023

import time
from array import array
import sys

print("\nPython", sys.version.replace("\n",""), "\n")

a = array('d')
start = time.time()
for i in range(100000000): a.append(i)
end = time.time()

print('\tarray.append(): {0:3.1f}s'.format(end-start))

l = []
start = time.time()
for i in range(200000000): l.append(i)
end = time.time()

print('\tlist.append():  {0:3.1f}s'.format(end-start))

d = {}
start = time.time()
for i in range(25000000): d.update({str(i):i})
end = time.time()

print('\tdict.update():  {0:3.1f}s'.format(end-start))

s = set()
start = time.time()
for i in range(160000000): s.add(i)
end = time.time()

print('\tset.add():      {0:3.1f}s'.format(end-start))

Five manual, consecutive runs with Python 3.11:

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 9.4s
        list.append():  10.0s
        dict.update():  11.8s
        set.add():      9.2s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 10.6s
        list.append():  9.4s
        dict.update():  12.2s
        set.add():      10.2s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 10.4s
        list.append():  9.0s
        dict.update():  11.9s
        set.add():      10.0s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 9.7s
        list.append():  8.7s
        dict.update():  12.0s
        set.add():      9.2s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 9.5s
        list.append():  9.4s
        dict.update():  12.2s
        set.add():      9.6s

Five manual, consecutive runs with Python 3.12:

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 11.5s
        list.append():  10.4s
        dict.update():  13.1s
        set.add():      10.3s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 10.9s
        list.append():  11.9s
        dict.update():  12.7s
        set.add():      11.1s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 11.7s
        list.append():  10.8s
        dict.update():  13.3s
        set.add():      10.5s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 11.4s
        list.append():  11.2s
        dict.update():  12.9s
        set.add():      10.7s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 10.8s
        list.append():  9.6s
        dict.update():  12.8s
        set.add():      11.2s

I would say the results are of statistical significance and the slowdown in Python 3.12 versus Python 3.11 is evident here. It seems plausible that the root cause is related to memory management, not specific to the dict built-in as I had originally thought: this test case consumes 20GB+ of memory.

@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 10, 2023

I should check the performance of the other dict methods I referenced in the original submission as there may be multiple issues here. I'll wait for feedback first on these tests.

@gaogaotiantian
Copy link
Member

Yes, this is observable slowdown. What's the result when the loop is smaller? Or when the memory consumption is smaller (same loop count, but do not accumulate memory that much).

It's not a common pattern to accumulate very small pieces of memory to a very large number (not saying that the slowdown does not mean anything). It could be a tradeoff to make common pattern faster.

It could be a memory management related reason, but it could also be something else. Not sure who's the expert in this area, but it would be nice to have more information (tests I mentioned above) so we may narrow down the cause.

@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 11, 2023

Here's a quick response for the case where the loop is 100 times smaller for each datatype (again slowdown observed):

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 95.3ms
        list.append():  77.9ms
        dict.update():  74.0ms
        set.add():      99.2ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 97.6ms
        list.append():  82.8ms
        dict.update():  83.3ms
        set.add():      105.2ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 112.2ms
        list.append():  86.6ms
        dict.update():  75.9ms
        set.add():      104.6ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 95.3ms
        list.append():  82.1ms
        dict.update():  72.0ms
        set.add():      99.8ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 98.5ms
        list.append():  79.1ms
        dict.update():  74.8ms
        set.add():      108.0ms
Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 112.6ms
        list.append():  93.8ms
        dict.update():  82.1ms
        set.add():      112.5ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 112.6ms
        list.append():  105.6ms
        dict.update():  82.7ms
        set.add():      115.7ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 109.3ms
        list.append():  90.1ms
        dict.update():  83.5ms
        set.add():      117.1ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 108.1ms
        list.append():  90.6ms
        dict.update():  80.6ms
        set.add():      106.5ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 110.5ms
        list.append():  91.9ms
        dict.update():  84.2ms
        set.add():      110.8ms

@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 11, 2023

Here's the original test case but with immediate removal of the data entry inside the loop either with pop() or del d[key]. In this case there is no growth in memory usage (as expected). It's unclear to me that this tells us much more. The previous tests would suggest we simply have a per operation slowdown.

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.3s
        list.append():  15.7s
        dict.update():   9.7s
        set.add():      10.7s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.9s
        list.append():  16.8s
        dict.update():   9.9s
        set.add():      12.1s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.3s
        list.append():  15.2s
        dict.update():   9.6s
        set.add():      11.9s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.9s
        list.append():  15.1s
        dict.update():   9.6s
        set.add():      12.1s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.3s
        list.append():  14.8s
        dict.update():   9.4s
        set.add():      12.5s
Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 14.9s
        list.append():  16.8s
        dict.update():  11.6s
        set.add():      12.2s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 15.4s
        list.append():  17.6s
        dict.update():  12.1s
        set.add():      13.3s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 14.3s
        list.append():  16.1s
        dict.update():  11.7s
        set.add():      14.1s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 16.2s
        list.append():  16.2s
        dict.update():  12.1s
        set.add():      12.9s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 15.4s
        list.append():  19.3s
        dict.update():  11.7s
        set.add():      15.3s

@chrisgmorton
Copy link
Author

The performance regression seems to take place between 3.12.0a7 and 3.12.0b1. I'm not sure it's a single commit. It may be an accumulation of a number of commits, possibly in the area of sub-interpreter work.

@eendebakpt
Copy link
Contributor

I can confirm there is some kind of regression. Performing this benchmark:

import pyperf
runner = pyperf.Runner()

setup="""
import copy

a={'list': [1,2,3,43], 't': (1,2,3), 'str': 'hello', 'subdict': {'a': True}}

"""

runner.timeit(name=f"deepcopy dict", stmt=f"b=copy.deepcopy(a)", setup=setup)

results in

Mean +- std dev: [3114] 7.18 us +- 0.09 us -> [main] 7.91 us +- 0.22 us: 1.10x slower

This is for comparison of 3.11.4 with current main.

@eendebakpt
Copy link
Contributor

From speed.python.org:

image

The performance regression seems to occur around the merge of #19474, which is known to reduce performance a bit.

@chrisgmorton
Copy link
Author

For the bigger picture, my application is in a class of Scientific Computing applications, numerically solving coupled non-linear PDEs and extensively using NumPy, SciPy, H5Py, and sparse matrix linear algebra.

The 3.11.4 and 3.12.0rc2 installations use identical Python module versions (NumPy, SciPy, H5Py/HDF5 built from source for each Python version).

These are my performance test results, benchmarked against Python 3.8:

Python 3.11.4:

 > Testing circperformance

Test 0: Time is 3.20771s, reference is 4.10, performance is 1.28x or 27.8% speed-up. PASS
Test 1: Time is 1.11444s, reference is 1.40, performance is 1.26x or 25.6% speed-up. PASS
Test 2: Time is 1.45542s, reference is 1.70, performance is 1.17x or 16.8% speed-up. PASS
Test 3: Time is 2.01886s, reference is 2.40, performance is 1.19x or 18.9% speed-up. PASS

100% pass: 4 Tests, failures[], skipped[]

Python 3.12.0rc2:

 > Testing circperformance

Test 0: Time is 3.78364s, reference is 4.10, performance is 1.08x or 8.4% speed-up. PASS
Test 1: Time is 1.30262s, reference is 1.40, performance is 1.07x or 7.5% speed-up. PASS
Test 2: Time is 1.61533s, reference is 1.70, performance is 1.05x or 5.2% speed-up. PASS
Test 3: Time is 2.41445s, reference is 2.40, performance is 0.99x or 0.6% slow-down. FAIL

75% pass: 4 Tests, failures[3], skipped[]

Python 3.11.4:

 > Testing devperformance

Test 0: Time is 1.64307s, reference is 1.85, performance is 1.13x or 12.6% speed-up. PASS
Test 1: Time is 2.28619s, reference is 2.60, performance is 1.14x or 13.7% speed-up. PASS
Test 2: Time is 1.62358s, reference is 1.90, performance is 1.17x or 17.0% speed-up. PASS
Test 3: Time is 1.75078s, reference is 2.00, performance is 1.14x or 14.2% speed-up. PASS
Test 4: Time is 2.06646s, reference is 2.20, performance is 1.06x or 6.5% speed-up. PASS
Test 5: Time is 1.57825s, reference is 1.75, performance is 1.11x or 10.9% speed-up. PASS
Test 6: Time is 2.35982s, reference is 2.80, performance is 1.19x or 18.7% speed-up. PASS
Test 7: Time is 2.03836s, reference is 2.20, performance is 1.08x or 7.9% speed-up. PASS
Test 8: Time is 4.01285s, reference is 4.50, performance is 1.12x or 12.1% speed-up. PASS

100% pass: 9 Tests, failures[], skipped[]

Python 3.12.0rc2

> Testing devperformance

Test 0: Time is 1.85779s, reference is 1.85, performance is 1.00x or 0.4% slow-down. FAIL
Test 1: Time is 2.53469s, reference is 2.60, performance is 1.03x or 2.6% speed-up. PASS
Test 2: Time is 1.89236s, reference is 1.90, performance is 1.00x or 0.4% speed-up. PASS
Test 3: Time is 1.98677s, reference is 2.00, performance is 1.01x or 0.7% speed-up. PASS
Test 4: Time is 2.25508s, reference is 2.20, performance is 0.98x or 2.5% slow-down. FAIL
Test 5: Time is 1.77294s, reference is 1.75, performance is 0.99x or 1.3% slow-down. FAIL
Test 6: Time is 2.61213s, reference is 2.80, performance is 1.07x or 7.2% speed-up. PASS
Test 7: Time is 2.19335s, reference is 2.20, performance is 1.00x or 0.3% speed-up. PASS
Test 8: Time is 4.52551s, reference is 4.50, performance is 0.99x or 0.6% slow-down. FAIL

55% pass: 9 Tests, failures[0, 4, 5, 8], skipped[]

As expected, we see substantial performance improvements for Python 3.11 over Python 3.8. However, Python 3.12.0rc2 appears to give up much of these gains.

@eendebakpt
Copy link
Contributor

eendebakpt commented Sep 22, 2023

I did some more tests with publicly available packages (sympy, lmfit and lark) and compared python 3.10, 3.11 and 3.12. The results are similar to the OP: a significant performance improvement from 3.10 to 3.11, but a regression from 3.11 to 3.12 for two out of the three test cases. For sympy the performance loss is about 10%.

3.10.8 (tags/v3.10.8-dirty:aaaf517424, May  9 2023, 23:40:53) [GCC 9.4.0]
sympy 0.35402853199775564
lmfit 0.3148488170008932
lark 0.3982844579986704
3.11.5 (main, Sep 22 2023, 16:56:56) [GCC 9.4.0]
sympy 0.2964389949993347
lmfit 0.281992337997508
lark 0.3138663429999724
3.12.0rc3 (main, Sep 22 2023, 21:36:02) [GCC 9.4.0]
sympy 0.32691953900211956
lmfit 0.29783570200015674
lark 0.2835550689997035
Full code for benchmarks

Source was downloaded from python.org and compiled with ./configure --enable-optimizations. Required packages have been installed with

python -m pip install sympy==1.12 numpy==1.26.0 scipy==1.11.2 lmfit==1.2.2 pyperformance==1.0.9 lark==1.1.7

Benchmark script

import sys
import time

import sympy
from lark import Lark, Transformer, v_args
from lmfit import Model
from numpy import exp, linspace, random

print(sys.version)


# %%


def gaussian(x, amp, cen, wid):
    return amp * exp(-(x-cen)**2 / wid)


def bench_sympy():
    x = sympy.symbols('x')

    for n in [1, 10, 20]:
        expr = x**n*sympy.exp(x)
        integral = sympy.Integral(expr, x)
        r = integral.doit()

        expr = sympy.log(sympy.Add(*[sympy.exp(i*x)
                         for i in range(n)])).diff(x)
        integral = sympy.Integral(expr, x)
        r = integral.doit()


# activate caches in sympy
bench_sympy()

t0 = time.perf_counter()
bench_sympy()
dt = time.perf_counter()-t0
print(f'sympy {dt}')

# %%


def bench_lmfit():

    x = linspace(-10, 10, 101)
    y = gaussian(x, 2.33, 0.21, 1.51) + random.normal(0, 0.2, x.size)

    gmodel = Model(gaussian)
    params = gmodel.make_params(cen=0.3, amp=3, wid=1.25)
    result = gmodel.fit(y, params, x=x)

    gmodel.set_param_hint('amp', min=2.34)
    result = gmodel.fit(y, params, x=x)


t0 = time.perf_counter()
for ii in range(40):
    bench_lmfit()
dt = time.perf_counter()-t0
print(f'lmfit {dt}')


# %%


json_grammar = r"""
    ?start: value

    ?value: object
          | array
          | string
          | SIGNED_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

    array  : "[" [value ("," value)*] "]"
    object : "{" [pair ("," pair)*] "}"
    pair   : string ":" value

    string : ESCAPED_STRING

    %import common.ESCAPED_STRING
    %import common.SIGNED_NUMBER
    %import common.WS

    %ignore WS
"""


class TreeToJson(Transformer):
    @v_args(inline=True)
    def string(self, s):
        return s[1:-1].replace('\\"', '"')

    array = list
    pair = tuple
    object = dict
    number = v_args(inline=True)(float)

    def null(self, _): return None
    def true(self, _): return True
    def false(self, _): return False


def bench_lark():
    json_parser = Lark(json_grammar, parser='lalr',
                       lexer='basic',
                       propagate_positions=False,
                       maybe_placeholders=False,
                       transformer=TreeToJson())
    parse = json_parser.parse

    test_json = '''
        {
            "empty_object" : {},
            "empty_array"  : [],
            "booleans"     : { "YES" : true, "NO" : false },
            "numbers"      : [ 0, 1, -2, 3.3, 4.4e5, 6.6e-7 ],
            "strings"      : [ "This", [ "And" , "That", "And a \\"b" ] ],
            "nothing"      : null
        }
    '''

    j = parse(test_json)


t0 = time.perf_counter()
for ii in range(20):
    bench_lark()
dt = time.perf_counter()-t0
print(f'lark {dt}')

Update: I benchmarked sympy for ea2c001 (the immortal instances PR) and 916de04 (the commit before)
The performance regression is 5% (averaged over 10 runs: base: 0.29103303970023264 immortal instances: 0.30650054839989027). That means the other 5% performance regression is due to other commits between 3.11 and 3.12

@nijel
Copy link
Contributor

nijel commented Oct 12, 2023

I've found this issue while noticing that our CI tests occasionally timeout at GitHub on Python 3.12 while they worked fine on 3.11. We did see performance gains for several Python releases, but 3.12 seems to go back to the performance of 3.9 (at least CI wise).

Slowdown can be seen on other projects as well. For example, the above-mentioned sympy has some parts of the test suite (they run it split into 4 parts) 20% slower on Python 3.12 compared to 3.11, Django tests seem about 10% slower. I know that CI timing can vary a lot, but the regression seems to happen across projects, and this issue shows some synthetic benchmarks as well.

Links to CI runs

@Voultapher
Copy link

Voultapher commented Oct 27, 2023

The user sahnehaeubchen linked this issue to PEP 683 in this discussion, so far I don't see the possible connection talked about here. I don't know what they are basing the claim on, but it seems plausible to me that an additional branch on every object refcount check could have wide ranging impacts. The PEP claims:

A naive implementation of the approach described below makes CPython roughly 4% slower. However, the implementation is performance-neutral once known mitigations are applied.

I could image that the benchmarks performed might not account for branch miss-prediction due to BTB misses and generally larger code-sizes that don't fit into i-cache L1 and L0/1 BTB and other cache effects of larger programs that affect branch prediction cost.

@iritkatriel iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 28, 2023
@ericsnowcurrently
Copy link
Member

CC @eduardo-elizondo

@eduardo-elizondo
Copy link
Contributor

Hey, just catching up here! Thanks for the ping @ericsnowcurrently - following on the above there is a link on the regression after the implementation of PEP683. There are benchmarks that perform better and benchmarks that perform worse. For the full list, take a look here which shows a ~1.02x regression on the geometric average.

One thing to add though, the build here uses gcc/g++ 9.4 which I empirically found to perform slightly worse with PEP683. Instead, I’d recommend trying this out again with GCC 11.4 (or LLVM 15+) which seems to fare much better with PEP683 than GCC 9.4.

Sorry for the delay in replies and happy to answer any follow-up questions 🙂

@chrisgmorton
Copy link
Author

chrisgmorton commented Jun 27, 2024

Some time has gone by and my code has evolved somewhat but here's a comparison of GCC compilers for a shorter version of the test profiled at the beginning of this issue. Again, the build is on Ubuntu 20.04 and CPython, NumPy 1.26.0, SciPy 1.11.2, HDF5 1.14.4-3, H5Py 3.11 are all built from source using the indicated compiler. CPython is built using --enable-shared --enable-optimizations --with-lto configure options.

I have reset the timing reference to make Python 3.11.9+ the baseline for the comparison. The test case is a linear simulation of a circuit of resistors with 10,000+ unknowns and a lot of subcircuit hierarchy. A heavy burden is placed on networkx DiGraph() operations which is written in Python and extensively uses dictionaries.

Python 3.11.9+ (heads/3.11:1b0e63c81b, Jun 26 2024, 20:59:07) [GCC 9.4.0] on linux

Out:Time is 3.70357s, reference is 3.80, performance is 1.03x or 2.6% speed-up.
Out:Time is 3.70533s, reference is 3.80, performance is 1.03x or 2.6% speed-up.
Out:Time is 3.69513s, reference is 3.80, performance is 1.03x or 2.8% speed-up.
Out:Time is 3.67317s, reference is 3.80, performance is 1.03x or 3.5% speed-up.
Out:Time is 3.73866s, reference is 3.80, performance is 1.02x or 1.6% speed-up.

Python 3.12.4+ (heads/3.12:709ef004df, Jun 26 2024, 20:08:47) [GCC 9.4.0] on linux

Out:Time is 4.24421s, reference is 3.80, performance is 0.90x or 11.7% slow-down.
Out:Time is 4.31557s, reference is 3.80, performance is 0.88x or 13.6% slow-down.
Out:Time is 4.31514s, reference is 3.80, performance is 0.88x or 13.6% slow-down.
Out:Time is 4.27742s, reference is 3.80, performance is 0.89x or 12.6% slow-down.
Out:Time is 4.33357s, reference is 3.80, performance is 0.88x or 14.0% slow-down.

Python 3.12.4+ (heads/3.12:709ef004df, Jun 26 2024, 16:41:43) [GCC 11.4.0] on linux

Out:Time is 4.39564s, reference is 3.80, performance is 0.86x or 15.7% slow-down.
Out:Time is 4.38541s, reference is 3.80, performance is 0.87x or 15.4% slow-down.
Out:Time is 4.34700s, reference is 3.80, performance is 0.87x or 14.4% slow-down.
Out:Time is 4.40990s, reference is 3.80, performance is 0.86x or 16.0% slow-down.
Out:Time is 4.40531s, reference is 3.80, performance is 0.86x or 15.9% slow-down.

Python 3.12.4+ (heads/3.12:709ef004df, Jun 26 2024, 19:00:49) [GCC 13.1.0] on linux

Out:Time is 4.49208s, reference is 3.80, performance is 0.85x or 18.2% slow-down.
Out:Time is 4.44031s, reference is 3.80, performance is 0.86x or 16.9% slow-down.
Out:Time is 4.45659s, reference is 3.80, performance is 0.85x or 17.3% slow-down.
Out:Time is 4.37919s, reference is 3.80, performance is 0.87x or 15.2% slow-down.
Out:Time is 4.46576s, reference is 3.80, performance is 0.85x or 17.5% slow-down.

We again see the slowdown, moving from Python 3.11 to Python 3.12, with GCC 9.4. However, performance drops further when switching to GCC 11.4, a result not consistent with the findings of @eduardo-elizondo and @ericsnowcurrently. Strangely, performance is further degraded when switching to GCC 13.1.

@Voultapher
Copy link

Benchmarking is tricky and what scenarios are considered will shape the results. I believe @eduardo-elizondo and @ericsnowcurrently that in their testing they didn't see significant regressions. But from a higher level there is no way that there aren't a substantial of usage+cpu-uarch combinations where adding a branch instruction to every object refcount check doesn't cause perf regressions. The CPU pipeline can't always hide the extra work of the additional instruction and BTBs don't yield perfect no-latency results. Modern designs tend to be capable of zero-bubble branch prediction, but there are many CPUs out in the wild where that isn't the case and any branch will cause a couple of cycles of latency, which may or may not be hidable by the pipeline. And even big cpu-uarchs have limited BTB caching capacity, given enough branches in play at some point the pipeline will stall because it has to resteer.

Overall I wonder if the rather niche use-case enabled by PEP 683 is worth the wide-spread perf regressions caused by it.

@eendebakpt
Copy link
Contributor

@chrisgmorton Thanks for the followup on the benchmarks, quite interesting! In particular I am interested in the following: is the slowdown moving from gcc 9 to 11 and 13 also present for python 3.11.9? (e.g. independent from the PEP 683 work). If so, we can create a separate issue for this since it would be a roughly 5% slowdown when moving to a more recent compiler.

@chrisgmorton
Copy link
Author

chrisgmorton commented Jun 29, 2024

@eendebakpt, here's the analysis for different compilers with Python 3.11. I ran the GCC 9.4.0 results again, so we have runs done with the computer in approximately the same state. For the record, I'm using a Dell laptop with an Intel Core i9-10885H CPU, which has 16 Logical Processors (8 Cores), and 64GB RAM.

Python 3.11.9+ (heads/3.11:1b0e63c81b, Jun 26 2024, 20:59:07) [GCC 9.4.0] on linux

Out:Time is 3.66981s, reference is 3.80, performance is 1.04x or 3.5% speed-up.
Out:Time is 3.68831s, reference is 3.80, performance is 1.03x or 3.0% speed-up.
Out:Time is 3.63654s, reference is 3.80, performance is 1.04x or 4.5% speed-up.
Out:Time is 3.65363s, reference is 3.80, performance is 1.04x or 4.0% speed-up.
Out:Time is 3.70226s, reference is 3.80, performance is 1.03x or 2.6% speed-up.

Python 3.11.9+ (heads/3.11:1b0e63c81b, Jun 28 2024, 08:18:03) [GCC 11.4.0] on linux

Out:Time is 3.64905s, reference is 3.80, performance is 1.04x or 4.1% speed-up.
Out:Time is 3.64112s, reference is 3.80, performance is 1.04x or 4.4% speed-up.
Out:Time is 3.69406s, reference is 3.80, performance is 1.03x or 2.9% speed-up.
Out:Time is 3.68263s, reference is 3.80, performance is 1.03x or 3.2% speed-up.
Out:Time is 3.66976s, reference is 3.80, performance is 1.04x or 3.5% speed-up.

Python 3.11.9+ (heads/3.11:1b0e63c81b, Jun 28 2024, 09:03:55) [GCC 13.1.0] on linux

Out:Time is 3.64538s, reference is 3.80, performance is 1.04x or 4.2% speed-up.
Out:Time is 3.71220s, reference is 3.80, performance is 1.02x or 2.4% speed-up.
Out:Time is 3.71622s, reference is 3.80, performance is 1.02x or 2.3% speed-up.
Out:Time is 3.67533s, reference is 3.80, performance is 1.03x or 3.4% speed-up.
Out:Time is 3.72454s, reference is 3.80, performance is 1.02x or 2.0% speed-up.

Clearly, the slowdown by compiler is specific to Python 3.12. As we have both found, the slowdown in Python 3.12 compared to Python 3.11 reported here is not exclusively caused by the Immortal Objects PEP 683. Other check-ins between a7 and b1 of Python 3.12 contribute to the slowdown. The slowdown in Python 3.12 with later versions of GCC compiler also may or may not be connected to the root cause of the slowdown discussed in the core of this issue report.

@2trvl
Copy link

2trvl commented Jan 2, 2025

TLDR: unrelated issue with standard library, moved to #128641

On version 3.13 this problem became even more noticeable.

I have a function that parses hundreds of small files using configparser. The ConfigParser object is built on MutableMapping and uses dict's in its attribute values.

I made 3 versions of the program:
(M) Multiprocessing
(O) Original Routine
(T) Threading

And measured their performance using timeit:

Linux arch 6.10.6-arch1-1
Parsing 186 files

Python 3.11.11
(M) 5 loops, best of 5: 62.4 msec per loop
(O) 2 loops, best of 5: 110 msec per loop
(T) 2 loops, best of 5: 130 msec per loop

Python 3.12.8
(M) 5 loops, best of 5: 66.5 msec per loop
(O) 2 loops, best of 5: 118 msec per loop
(T) 2 loops, best of 5: 140 msec per loop

Python 3.13.1
(M) 2 loops, best of 5: 125 msec per loop
(O) 1 loop, best of 5: 222 msec per loop
(T) 1 loop, best of 5: 248 msec per loop

Python 3.13.1 Free-threaded
(M) 1 loop, best of 5: 331 msec per loop
(O) 1 loop, best of 5: 648 msec per loop
(T) 1 loop, best of 5: 340 msec per loop

Performance regression is 2-6 times between 3.11 and 3.13.

I'm attaching the files of the original routine and the threaded version. The code is for Linux, BSD and provides a method to find installed browsers for webbrowser.
nixbrowserO.py.txt
nixbrowserT.py.txt

@eendebakpt
Copy link
Contributor

@2trvl Is find_unix_browsers the method you have benchmarked? The method does many different things, so it could very well be the performance regression you see is not related to the dict (which has only a reproted regression upto 10% so far), but something else. Could you try to reduce the example to a simpler version? (e.g. script that only calls the ConfigParser, but not the shlex.split, shutil.which, etc.)

@2trvl
Copy link

2trvl commented Jan 2, 2025

@eendebakpt I exported all paths using glob to a list. Then copy-pasted it into a new module and started reading it with ConfigParser. Between 3.11/3.12 and 3.13 the time difference is again 2x even on a small number of files.

$ python3.11 -m timeit -s "import nixconfig" "nixconfig.main()"
50 loops, best of 5: 5.27 msec per loop
$ python3.12 -m timeit -s "import nixconfig" "nixconfig.main()"
50 loops, best of 5: 5.33 msec per loop
$ python3.13 -m timeit -s "import nixconfig" "nixconfig.main()"
20 loops, best of 5: 11.2 msec per loop

I also made sure that shlex.split, shutil.which, glob.iglob do not affect performance.

@eendebakpt
Copy link
Contributor

@2trvl Thanks. Could you share the nixconfig?

@2trvl
Copy link

2trvl commented Jan 2, 2025

@eendebakpt
Copy link
Contributor

Thanks, the nixconfig.py is clear now. I do not have the files that are parsed (e.g. '/usr/share/applications/python3.13.desktop', '/usr/share/applications/python3.10.desktop', ...). Could share a few of them?

If I understand correctly the parser.read(shortcut, encoding="utf-8") part of your script is what is taking time and where you see the regression. Correct?

@2trvl
Copy link

2trvl commented Jan 2, 2025

@eendebakpt You can generate them on Linux using nixexport.

Yes, ConfigParser.read() causes a regression. Nixconfig opens files one by one in a for loop. Otherwise, you can pass them all to read() at once, overwriting the contents of the parser, but the execution time will not change.

I'll start profiling ConfigParser tomorrow and try to find specific statements that are causing the slowdown.
shortcuts.zip

UPD:

I found out that it is not the interpreter problem, but the following configparser change 019143f. Sorry for bothering you, I will create a separate issue.

@erlend-aasland
Copy link
Contributor

I found out that it is not the interpreter problem, but the following configparser change 019143f. Sorry for bothering you, I will create a separate issue.

Ok. Make sure you create a minimal reproducer before opening the issue. Closing as not-a-bug.

@erlend-aasland erlend-aasland closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in Subinterpreters Jan 3, 2025
@chrisgmorton
Copy link
Author

@erlend-aasland, can you explain why this issue has been closed? The performance degradation of cpython post 3.11 is quite worrisome, especially given the focus on perf improvements.

@erlend-aasland
Copy link
Contributor

@chrisgmorton, I misinterpreted the digression started with #109049 (comment) to be the core issue; my bad. Reopening; thanks for the heads-up.

@erlend-aasland erlend-aasland reopened this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-subinterpreters type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests