Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random lock-ups when running tests/test_api_base.py::TestCreateCopy::test #45

Open
josch opened this issue Jan 21, 2025 · 0 comments
Open

Comments

@josch
Copy link

josch commented Jan 21, 2025

Hi,

I have sunk a few days of my time into this and was unable to come up with a solution. I'm documenting it here in case anybody else wants to pick this up where I left of.

The problem is, that on the Debian buildds, on the Debian gitlab CI as well as on tests.reproducible-builds.org, the bmaptool test suite runs into a dead-lock around ~50% of the time it is run. I am unable to reproduce this problem on my own laptop, so debugging this was not trivial. I hacked in a periodic call to pstree -l -a to have a look at what the process tree looks like when the deadlock happens:

  • bzip2 stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
        |   `-pstree -l -a 1641
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-sh -c bzip2 -c -k  /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_vq64huef.img
  • xz stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 13086 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 13086 >&2; done
        |   `-pstree -l -a 13086
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-sh -c xz -c -k  /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_kqm3eugf.img
  • tar stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 13126 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 13126 >&2; done
        |   `-pstree -l -a 13126
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-sh -c tar -c -j -O -P -C /  /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_gn6_77ax.img
  • gzip stalls:
  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 1640 >&2; done' & pid=$!; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- -vvv;  \\\012kill $pid; \\\012wait $pid || :
        |-sh -c while sleep 10; do pstree -l -a 1640 >&2; done
        |   `-pstree -l -a 1640
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- -vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 -vvv
                        `-sh -c cd /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest tests
                            `-python3.13 -m pytest tests
                                `-sh -c gzip -c /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_4up2jzoj.img

All of these tools are invoked by _generate_compressed_files() in tests/test_api_base.py so my guess was that the way that the tools are called using subprocess.Popen(), followed by Popen.wait() somehow causes the kind of deadlock that is described in the documentation of Popen.wait(). So I replaced those subprocess calls with calls to subprocess.check_output() and slurped the whole standard output into a Python string (it's only ~2 MB of data). But this did not change the behaviour at all.

Searching online for deadlocks involving the subprocess module revealed that in Python 2.7 there was a problem of deadlocks when the subprocess module was called from Python threads. The fix was supposedly to add close_fds=True to the Popen call. I tried that too but it resulted no change in behaviour.

Then I stumbled across a slightly different deadlock:

  rules -f debian/rules override_dh_auto_test
    `-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test  -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
        |-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
        |   `-pstree -l -a 1641
        `-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
            `-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
                `-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
                    `-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
                        `-sh -c cd /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
                            `-python3.13 -m pytest -vvv
                                `-df -PT -- /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_3wg1y132.img

The invocation of "df -PT" is from a completely different part of the code. It comes from get_file_system_type() in bmaptool/BmapHelpers.py which just runs:

       proc = subprocess.Popen(["df", "-PT", "--", abspath], stdout=PIPE, stderr=PIPE)
       stdout, stderr = proc.communicate()

But this should be completely harmless, no???

So maybe the problem is not with tests/test_api_base.py but the problem is a more fundamental one. I looked into other parts of the codebase and _open_compressed_file() in bmaptool/TransRead.py does an aweful lot of threading in combination with opening decompressors via the subprocess module.

I have no solution to the problem yet but I do have a workaround:

--- a/tests/helpers.py
+++ b/tests/helpers.py
@@ -284,29 +284,19 @@ def calculate_chksum(file_path):
     return hash_obj.hexdigest()
█
█
+import subprocess
+
+
 def copy_and_verify_image(image, dest, bmap, image_chksum, image_size):
     """
     Copy image 'image' using bmap file 'bmap' to the destination file 'dest'
     and verify the resulting image checksum.
     """
█
-    f_image = TransRead.TransRead(image)
-    f_dest = open(dest, "w+b")
-    if bmap:
-        f_bmap = open(bmap, "r")
-    else:
-        f_bmap = None
-
-    writer = BmapCopy.BmapCopy(f_image, f_dest, f_bmap, image_size)
-    # Randomly decide whether we want the progress bar or not
-    if bool(random.getrandbits(1)) and sys.stdout.isatty():
-        writer.set_progress_indicator(sys.stdout, None)
-    writer.copy(bool(random.getrandbits(1)), bool(random.getrandbits(1)))
+    subprocess.check_call(
+        ["python3", "-m", "bmaptool", "copy"]
+        + (["--bmap", bmap] if bmap else ["--nobmap"])
+        + [image, dest]
+    )-    # Compare the original file and the copy are identical
     assert calculate_chksum(dest) == image_chksum
-
-    if f_bmap:
-        f_bmap.close()
-    f_dest.close()
-    f_image.close()

Instead of calling BmapCopy.BmapCopy().copy(), spawn a new process running the
full bmaptool utility with the given options. This is slower than the existing
code because bmaptool gets spawned a lot but it will effectively test the same
thing. Except that with this change we do not anymore test using BmapCopy as a
library. But I do not believe that this test is useful for Debian because we do
not ship this as a library but we ship it as a program that is to be called
from the terminal. And when run like that, the deadlock problem never occurs.

I also documented this in the associated Debian bug: https://bugs.debian.org/1081336

Maybe somebody else feels motivated to look into this. It would probably help being able to reproduce this issue locally instead of a remote machine without ssh access. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant